VC6、VC2010、C#2010、VB6和MMX、SSE指令集运算性能大比拼（64位像素转32位像素）。请大家帮忙测试

zyl910 2012-04-13 05:46:09

侯佩:置顶！ at 2012-04-23 12:40

为了优化“64位像素转32位像素”，我测试了多种算法，并分别在VC6、VC2010、C#2010、VB6上实现了这些算法，同时还对比测试了MMX、SSE指令集版算法的性能。

测试成绩详见——
http://blog.csdn.net/zyl910/article/details/7458757

请大家帮忙做一下测试，看看其他硬件环境下的性能。

测试程序下载——
http://files.cnblogs.com/zyl910/noif_Test.rar
http://dl.dbank.com/c069c6thd7

...全文

1639 54 打赏收藏转发到动态举报

写回复

用AI写文章

54 条回复

切换为时间正序

请发表友善的回复…

发表回复

jizhan070202 2012-09-27

打赏
举报

厉害啊,学习了~~~~

为了神 2012-08-10

打赏
举报

太复杂了。对我来说乐呵呵

a765432101234567 2012-06-15

打赏
举报

挺好的，能行

cscacn 2012-06-13

打赏
举报

.B3.3:: ; Preds .B3.3 .B3.2
$LN509:
movsx r11d, WORD PTR [rdx+r8*8] ;67.3
$LN510:
test r11d, r11d ;67.3
$LN511:
cmovl r11d, r9d ;67.3
$LN512:
cmp r11d, 255 ;67.3
$LN513:
cmovge r11d, eax ;67.3
$LN514:
mov BYTE PTR [rcx+r8*4], r11b ;67.3
$LN515:
movsx r11d, WORD PTR [2+rdx+r8*8] ;67.3
$LN516:
test r11d, r11d ;67.3
$LN517:
cmovl r11d, r9d ;67.3
$LN518:
cmp r11d, 255 ;67.3
$LN519:
cmovge r11d, eax ;67.3
$LN520:
mov BYTE PTR [1+rcx+r8*4], r11b ;67.3
$LN521:
movsx r11d, WORD PTR [4+rdx+r8*8] ;67.3
$LN522:
test r11d, r11d ;67.3
$LN523:
cmovl r11d, r9d ;67.3
$LN524:
cmp r11d, 255 ;67.3
$LN525:
cmovge r11d, eax ;67.3
$LN526:
mov BYTE PTR [2+rcx+r8*4], r11b ;67.3
$LN527:
movsx r11d, WORD PTR [6+rdx+r8*8] ;67.3
$LN528:
test r11d, r11d ;67.3
$LN529:
cmovl r11d, r9d ;67.3
$LN530:
cmp r11d, 255 ;67.3
$LN531:
cmovge r11d, eax ;67.3
$LN532:
mov BYTE PTR [3+rcx+r8*4], r11b ;67.3
$LN533:
inc r8 ;64.2
$LN534:
cmp r8, r10 ;64.2
$LN535:
jb .B3.3 ; Prob 78% ;64.2
$LN536:
; LOE rdx rcx rbx rbp rsi rdi r8 r10 r12 r13 r14 r15 eax r9d xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B3.5:: ; Preds .B3.3 .B3.1

;;; pD[1] = min(max(0, pS[1]), 255);
;;; pD[2] = min(max(0, pS[2]), 255);
;;; pD[3] = min(max(0, pS[3]), 255);
;;; // next
;;; pS += 4;
;;; pD += 4;
;;; }
;;; }

$LN537:
ret ;75.1
ALIGN 16
$LN538:
; LOE
.B3.6::
$LN539:
; mark_end;
?f1_min@@YAXPEAEPEBFH@Z ENDP
$LN?f1_min@@YAXPEAEPEBFH@Z$540:
$LN?f1_min@@YAXPEAEPEBFH@Z$541:
;?f1_min@@YAXPEAEPEBFH@Z ENDS

zyl910 2012-05-13

打赏
举报

结账了

yiao664 2012-05-11

打赏
举报

好啊！！看一下了！！

zyl910 2012-05-11

打赏
举报

还有没有新的测试成绩？预定后天周日结账。

搬了20多年的砖 2012-05-09

打赏
举报

代码不用看本身垃圾。。我只看了5楼的那个转换。。就知道你还比较初级。。虽然编程年龄很高了。
你那样的转换速度能上去吗？

zyl910 2012-05-05

打赏
举报

[Quote=引用 40 楼的回复:]

仔细看了看LZ的程序。
好像就是一个 m*4 的矩阵，对每个元素进行处理？大于 255的，设为255. 小于0的设为0.其余不变？

Values that exceed the range of the integer type are saturated to that range.
Fractional values are rounded ？

如果是这样的……
[/Quote]

对！是对m*4的矩阵中的每一项饱和到[0,255]的范围。
为了避免计时精度问题，每次测试时循环4000次。

翻了一下IPP文档，貌似“Convert”系列函数是负责此功能的——

Intel® Integrated Performance Primitives Reference Manual, Volume 2: Image and Video Processing

4 Image Data Exchange and 4 Initialization Function



Convert

Converts image pixel values from one data type to another.

Syntax

Case 1: Conversion to increase bit depth and change signed to unsigned type

……

Case 2: Conversion to reduce bit depth and change unsigned to signed type:

integer to integer type

IppStatus ippiConvert_8u1u_C1R(const Ipp8u* pSrc, int srcStep, Ipp8u* pDst, int dstStep,

int dstBitOffset, IppiSize roiSize, Ipp8u threshold);

IppStatus ippiConvert_<mod>(const Ipp<scrDatatype>* pSrc, int srcStep, Ipp<dstDatatype>*

pDst, int dstStep, IppiSize roiSize);

Supported values for mod:

16u8u_C1R 16s8u_C1R 32s8u_C1R 32s8s_C1R

16u8u_C3R 16s8u_C3R 32s8u_C3R 32s8s_C3R

16u8u_C4R 16s8u_C4R 32s8u_C4R 32s8s_C4R

16u8u_AC4R 16s8u_AC4R 32s8u_AC4R 32s8s_AC4R



……



Description

The function ippiConvert is declared in the ippi.h file. It operates with ROI (see Regions of Interest in Intel IPP).

This function converts pixel values in the source image ROI pSrc to a different data type and writes them to the destination image ROI pDst.

The result of integer operations is always saturated to the destination data type range. It means that if the value

of the source pixel is out of the data range of the destination image, the value of the corresponding destination pixel is set to the value of the lower or upper bound (minimum or maximum) of the destination data range:

x = pSrc[i,j]

if (x > MAX_VAL) x = MAX_VAL

if (x < MIN_VAL) x = MIN_VAL

pDst[i,j] = (CASTING)x

大家可以试一试。

zyl910 2012-05-05

打赏
举报

[Quote=引用 38 楼的回复:]

###########intel

== noif:VC2010(64) on 64bit ==<Press any key to continue>
f0_if[1]: 921
f0_if[2]: 903
f0_if[3]: 920
f1_min[1]: 277
f1_min[2]: 280
f1_min[3]: 290
f2_neg[1]: 457
f2_neg[2]: 447
f2_neg[3]: 446
f3_sar[1]: 367
f3_sar[2]: 388
f3_sar[3]: 378
f5_sse[1]: 25.5
f5_sse[2]: 25.2
f5_sse[3]: 25.4

[/Quote]

看来对于人工编写的SSE代码，编译器很难再进一步优化了。

CandPointer 2012-05-05

打赏
举报

仔细看了看LZ的程序。
好像就是一个 m*4 的矩阵，对每个元素进行处理？大于 255的，设为255. 小于0的设为0.其余不变？

Values that exceed the range of the integer type are saturated to that range.
Fractional values are rounded ？

如果是这样的话，直接矩阵操作，很快的



//matlab 代码

tic; //计时开始

datSz = 16384; 

// 数据规模 ，和LZ 的#define DATASIZE    16384 

tDat = round(-128 + (383--128).*rand(datSz,4));

//生成数据， 随机数据，范围 -128 ~ 383，参照LZ的



tDat(tDat<0)=0; //小于0的归0

tDat(tDat>255)=255; //大于255的归255

tDat= fix(tDat); 

//小数的变为整数。  因为matlab是 double类型的数据。 如果需转为int16等，也可转

toc //计时结束



Elapsed time is 0.007347 seconds.  //耗时 0.007 秒



最后检查看看数据， 抽取若干个，

tDat(1:3000:end,:)

ans =

    20   255    93     0

    17     0   255     0

   183     0   236   158

   171   255     0   255

     0   171   102     0

   140    60   161   255

上面是手写的。

matlab的图像处理，不熟悉。应该有专用的函数，性能可能更强、

matlab里面是可以用IPP 的、

On Intel® architecture processors, the image arithmetic functions can take advantage of the Intel Integrated Performance Primitives (Intel IPP) library, thus accelerating their execution time. The Intel IPP library is only activated, however, when the data passed to these functions is of specific classes. See the reference pages for the individual arithmetic functions for more information.

intel 编译器附带的Integrated Performance Primitives （IPP）里面， volume 2，就是 image and video processing 。有各种函数，不知道具体哪个是负责该功能的。

不熟悉IPP的一堆函数，不然，可以测试 intel 的IPP性能。

日立奔腾浪潮微软松下联想 2012-05-05

打赏
举报

xor edx, edx
cmp dx, ax

VC2010居然能生成这样的垃圾代码。修改一个寄存器之后马上访问它的低16位会额外增加至少1个时钟周期的延迟。

CandPointer 2012-05-05

打赏
举报

[Quote=引用 27 楼的回复:]

改进了一下noifVC2010，增加MMX、SSE测试函数。
可惜VC++2010编译的程序的运行速度与VC6差不多，大家可以试一试其他编译器。

代码如下——
C/C++ code
////////////////////////////////////////////////////////////
// noifVC2010.cpp : VC2010饱和处理速度测试
// Aut……
[/Quote]





###########intel



== noif:VC2010(64) on 64bit ==<Press any key to continue>

f0_if[1]:	921

f0_if[2]:	903

f0_if[3]:	920

f1_min[1]:	277

f1_min[2]:	280

f1_min[3]:	290

f2_neg[1]:	457

f2_neg[2]:	447

f2_neg[3]:	446

f3_sar[1]:	367

f3_sar[2]:	388

f3_sar[3]:	378

f5_sse[1]:	25.5

f5_sse[2]:	25.2

f5_sse[3]:	25.4

<Press any key to exit>





##########VC 2010



== noif:VC2010(64) on 64bit ==<Press any key to continue>

f0_if[1]:	1465

f0_if[2]:	1439

f0_if[3]:	1453

f1_min[1]:	1583

f1_min[2]:	1554

f1_min[3]:	1580

f2_neg[1]:	379

f2_neg[2]:	378

f2_neg[3]:	380

f3_sar[1]:	244

f3_sar[2]:	252

f3_sar[3]:	250

f5_sse[1]:	26.0

f5_sse[2]:	25.8

f5_sse[3]:	25.3

<Press any key to exit>

CandPointer 2012-05-05

打赏
举报

允许是， cmove 系列是 p6族处理器中开始引入的，并不是所有处理器都支持的。比如 286/386处理器就不支持 cmove。

所以，大概为了386处理器， vs2010 编译的，不用 cmove，反而用各种跳转。

CandPointer 2012-05-05

打赏
举报

[Quote=引用 35 楼的回复:]

引用 33 楼的回复:

宏融合的作用应该只是将“test/cmp+Jcc”打包为一条微指令。本身不负责分支预测失败的处理。

当分支预测的成功率很高时，因宏融合减少了微指令，性能有可能比CMOVcc要高。
但当分支预测的成功率很低时（例如做数据饱和处理时），频繁的分支预测失败会严重影响流水线性能。而面对这种情况，无分支代码仍能保证流水线满负荷工作。

现在CPU有好几十级的流……
[/Quote]

我上面测试的电脑，是 i7 920，属于Nehalem 架构，宏融合貌似可以 cmp + jle/jg...

而LZ测试的cpu 是 i3-2310M, 应该是 sandy bridge 架构，说是除了test和cmp， add/sub也可以作为宏融合的第一条指令

VC 编译出的，可以宏融合的，实际性能，还是有很大差距啊
看来，优化手册说的 cmove 确实性能不错啊

又测了下 32位程序，





----------------------

visual studio 2010 

== noif:VC2010(32) on 64bit ==<Press any key to continue>

f0_if[1]:	1526

f0_if[2]:	1531

f0_if[3]:	1503

f1_min[1]:	2190

f1_min[2]:	2221

f1_min[3]:	2205

f2_neg[1]:	421

f2_neg[2]:	423

f2_neg[3]:	414

f3_sar[1]:	321

f3_sar[2]:	327

f3_sar[3]:	329

<Press any key to exit>





-----------------------

intel parallel studio 2011 xe



== noif:VC2010(32) on 64bit ==<Press any key to continue>

f0_if[1]:	900

f0_if[2]:	876

f0_if[3]:	874

f1_min[1]:	271

f1_min[2]:	270

f1_min[3]:	267

f2_neg[1]:	462

f2_neg[2]:	453

f2_neg[3]:	458

f3_sar[1]:	379

f3_sar[2]:	373

f3_sar[3]:	380

<Press any key to exit>



VC 2010







PUBLIC	?f1_min@@YAXPAEPBFH@Z				; f1_min

; Function compile flags: /Ogtp

;	COMDAT ?f1_min@@YAXPAEPBFH@Z

_TEXT	SEGMENT

_pbufD$ = 8						; size = 4

_pbufS$ = 12						; size = 4

_cnt$ = 16						; size = 4

?f1_min@@YAXPAEPBFH@Z PROC				; f1_min, COMDAT



; 60   : {



	push	ebp

	mov	ebp, esp



; 62   : 	BYTE* pD = pbufD;

; 63   : 	int i;

; 64   : 	for(i=0; i<cnt; ++i)



	mov	eax, DWORD PTR _cnt$[ebp]

	test	eax, eax

	jle	$LN1@f1_min



; 61   : 	const signed short* pS = pbufS;



	mov	ecx, DWORD PTR _pbufD$[ebp]

	push	esi

	mov	esi, DWORD PTR _pbufS$[ebp]

	push	edi

	add	ecx, 2

	add	esi, 4

	mov	edi, eax

	npad	2

$LL3@f1_min:



; 65   : 	{

; 66   : 		// 分别对4个通道做饱和处理

; 67   : 		pD[0] = min(max(0, pS[0]), 255);



	movzx	eax, WORD PTR [esi-4]

	xor	edx, edx

	cmp	dx, ax

	jg	SHORT $LN32@f1_min

	cmp	ax, 255					; 000000ffH

	jge	SHORT $LN10@f1_min

$LN32@f1_min:

	xor	edx, edx

	cmp	dx, ax

	jle	SHORT $LN8@f1_min

	xor	eax, eax

	jmp	SHORT $LN11@f1_min

$LN8@f1_min:

	cwde

	jmp	SHORT $LN11@f1_min

$LN10@f1_min:

	mov	eax, 255				; 000000ffH

$LN11@f1_min:

	mov	BYTE PTR [ecx-2], al



; 68   : 		pD[1] = min(max(0, pS[1]), 255);



	movzx	eax, WORD PTR [esi-2]

	xor	edx, edx

	cmp	dx, ax

	jg	SHORT $LN33@f1_min

	cmp	ax, 255					; 000000ffH

	jge	SHORT $LN16@f1_min

$LN33@f1_min:

	xor	edx, edx

	cmp	dx, ax

	jle	SHORT $LN14@f1_min

	xor	eax, eax

	jmp	SHORT $LN17@f1_min

$LN14@f1_min:

	cwde

	jmp	SHORT $LN17@f1_min

$LN16@f1_min:

	mov	eax, 255				; 000000ffH

$LN17@f1_min:

	mov	BYTE PTR [ecx-1], al



; 69   : 		pD[2] = min(max(0, pS[2]), 255);



	movzx	eax, WORD PTR [esi]

	xor	edx, edx

	cmp	dx, ax

	jg	SHORT $LN34@f1_min

	cmp	ax, 255					; 000000ffH

	jge	SHORT $LN22@f1_min

$LN34@f1_min:

	xor	edx, edx

	cmp	dx, ax

	jle	SHORT $LN20@f1_min

	xor	eax, eax

	jmp	SHORT $LN23@f1_min

$LN20@f1_min:

	cwde

	jmp	SHORT $LN23@f1_min

$LN22@f1_min:

	mov	eax, 255				; 000000ffH

$LN23@f1_min:

	mov	BYTE PTR [ecx], al



; 70   : 		pD[3] = min(max(0, pS[3]), 255);



	movzx	eax, WORD PTR [esi+2]

	xor	edx, edx

	cmp	dx, ax

	jg	SHORT $LN35@f1_min

	cmp	ax, 255					; 000000ffH

	jge	SHORT $LN28@f1_min

$LN35@f1_min:

	xor	edx, edx

	cmp	dx, ax

	jle	SHORT $LN26@f1_min

	xor	eax, eax

	jmp	SHORT $LN29@f1_min

$LN26@f1_min:

	cwde

	jmp	SHORT $LN29@f1_min

$LN28@f1_min:

	mov	eax, 255				; 000000ffH

$LN29@f1_min:

	mov	BYTE PTR [ecx+1], al



; 71   : 		// next

; 72   : 		pS += 4;



	add	esi, 8



; 73   : 		pD += 4;



	add	ecx, 4

	dec	edi

	jne	$LL3@f1_min

	pop	edi

	pop	esi

$LN1@f1_min:



; 74   : 	}

; 75   : }



	pop	ebp

	ret	0

?f1_min@@YAXPAEPBFH@Z ENDP				; f1_min

_TEXT	ENDS

intel





_TEXT	SEGMENT  PARA PUBLIC FLAT  'CODE'

;	COMDAT ?f1_min@@YAXPAEPBFH@Z

TXTST2:

; -- Begin  ?f1_min@@YAXPAEPBFH@Z

; mark_begin;

       ALIGN     16

	PUBLIC ?f1_min@@YAXPAEPBFH@Z

?f1_min@@YAXPAEPBFH@Z	PROC NEAR 

; parameter 1(pbufD): 20 + esp

; parameter 2(pbufS): 24 + esp

; parameter 3(cnt): 28 + esp

.B3.1:                          ; Preds .B3.0



;;; {



$LN501:

        sub       esp, 16                                       ;60.1

$LN502:

        mov       edx, DWORD PTR [28+esp]                       ;59.6

$LN503:



;;; 	const signed short* pS = pbufS;

;;; 	BYTE* pD = pbufD;

;;; 	int i;

;;; 	for(i=0; i<cnt; ++i)



        test      edx, edx                                      ;64.2

$LN504:

        jle       .B3.5         ; Prob 10%                      ;64.2

$LN505:

                                ; LOE edx ebx ebp esi edi

.B3.2:                          ; Preds .B3.1

        xor       ecx, ecx                                      ;

        mov       DWORD PTR [12+esp], esi                       ;

        xor       eax, eax                                      ;

        mov       DWORD PTR [8+esp], edi                        ;

        mov       DWORD PTR [4+esp], ebx                        ;

        mov       DWORD PTR [esp], ebp                          ;

        mov       ebp, 255                                      ;

        mov       esi, DWORD PTR [24+esp]                       ;

        mov       edi, DWORD PTR [20+esp]                       ;

        ALIGN     16

$LN506:

                                ; LOE eax edx ecx ebp esi edi

.B3.3:                          ; Preds .B3.3 .B3.2

$LN507:



;;; 	{

;;; 		// 分别对4个通道做饱和处理

;;; 		pD[0] = min(max(0, pS[0]), 255);



        movsx     ebx, WORD PTR [esi+ecx*8]                     ;67.3

$LN508:

        test      ebx, ebx                                      ;67.3

$LN509:

        cmovl     ebx, eax                                      ;67.3

$LN510:

        cmp       ebx, 255                                      ;67.3

$LN511:

        cmovge    ebx, ebp                                      ;67.3

$LN512:

        mov       BYTE PTR [edi+ecx*4], bl                      ;67.3

$LN513:

        movsx     ebx, WORD PTR [2+esi+ecx*8]                   ;67.3

$LN514:

        test      ebx, ebx                                      ;67.3

$LN515:

        cmovl     ebx, eax                                      ;67.3

$LN516:

        cmp       ebx, 255                                      ;67.3

$LN517:

        cmovge    ebx, ebp                                      ;67.3

$LN518:

        mov       BYTE PTR [1+edi+ecx*4], bl                    ;67.3

$LN519:

        movsx     ebx, WORD PTR [4+esi+ecx*8]                   ;67.3

$LN520:

        test      ebx, ebx                                      ;67.3

$LN521:

        cmovl     ebx, eax                                      ;67.3

$LN522:

        cmp       ebx, 255                                      ;67.3

$LN523:

        cmovge    ebx, ebp                                      ;67.3

$LN524:

        mov       BYTE PTR [2+edi+ecx*4], bl                    ;67.3

$LN525:

        movsx     ebx, WORD PTR [6+esi+ecx*8]                   ;67.3

$LN526:

        test      ebx, ebx                                      ;67.3

$LN527:

        cmovl     ebx, eax                                      ;67.3

$LN528:

        cmp       ebx, 255                                      ;67.3

$LN529:

        cmovge    ebx, ebp                                      ;67.3

$LN530:

        mov       BYTE PTR [3+edi+ecx*4], bl                    ;67.3

$LN531:

        inc       ecx                                           ;64.2

$LN532:

        cmp       ecx, edx                                      ;64.2

$LN533:

        jb        .B3.3         ; Prob 78%                      ;64.2

$LN534:

                                ; LOE eax edx ecx ebp esi edi

.B3.4:                          ; Preds .B3.3

        mov       esi, DWORD PTR [12+esp]                       ;

        mov       edi, DWORD PTR [8+esp]                        ;

        mov       ebx, DWORD PTR [4+esp]                        ;

        mov       ebp, DWORD PTR [esp]                          ;

$LN535:

                                ; LOE ebx ebp esi edi

.B3.5:                          ; Preds .B3.4 .B3.1



;;; 		pD[1] = min(max(0, pS[1]), 255);

;;; 		pD[2] = min(max(0, pS[2]), 255);

;;; 		pD[3] = min(max(0, pS[3]), 255);

;;; 		// next

;;; 		pS += 4;

;;; 		pD += 4;

;;; 	}

;;; }



$LN536:

        add       esp, 16                                       ;75.1

$LN537:

        ret                                                     ;75.1

        ALIGN     16

$LN538:

                                ; LOE

$LN539:

; mark_end;

?f1_min@@YAXPAEPBFH@Z ENDP

$LN?f1_min@@YAXPAEPBFH@Z$540:

$LN?f1_min@@YAXPAEPBFH@Z$541:

;?f1_min@@YAXPAEPBFH@Z	ENDS

_TEXT	ENDS

_DATA	SEGMENT  DWORD PUBLIC FLAT  'DATA'

_DATA	ENDS

; -- End  ?f1_min@@YAXPAEPBFH@Z

日立奔腾浪潮微软松下联想 2012-05-05

打赏
举报

[Quote=引用 33 楼的回复:]

宏融合的作用应该只是将“test/cmp+Jcc”打包为一条微指令。本身不负责分支预测失败的处理。

当分支预测的成功率很高时，因宏融合减少了微指令，性能有可能比CMOVcc要高。
但当分支预测的成功率很低时（例如做数据饱和处理时），频繁的分支预测失败会严重影响流水线性能。而面对这种情况，无分支代码仍能保证流水线满负荷工作。

现在CPU有好几十级的流水线，分支预测失败的惩罚是相当大的……
[/Quote]

宏融合实际相当于增加了一个译码器，比如core+架构处理器有4个译码器（每核心），每时钟周期最多可以解码4条x86指令（不是uops），如果其中两条可以宏融合，则每时钟周期最多可以解码5条x86指令，这就象车上如果两个人挤一个座，就可以多载一个人，性能当然会提升。另外CMOVcc是慢指令（相比test、cmp、mov）。

“好几十级的流水线”那是Pentium4时代的事了（PentiumD 31级），core架构是Pentium3、PentiumM的延续，只有14级流水线（包括现在的i7也是如此）。

zyl910 2012-05-04

打赏
举报

对于——
if ((t1|t2|t3)==0)
{
t4=1;
}

可写出这样的无分支代码——
t4 = ((t1|t2|t3)==0)

如果严格要求条件为假时t4不变，可以这样做——
t4 ^= (t4^1)&(-((t1|t2|t3)==0))
注：利用异或运算交换“1”与“原t4”。

zyl910 2012-05-04

打赏
举报

宏融合的作用应该只是将“test/cmp+Jcc”打包为一条微指令。本身不负责分支预测失败的处理。

当分支预测的成功率很高时，因宏融合减少了微指令，性能有可能比CMOVcc要高。
但当分支预测的成功率很低时（例如做数据饱和处理时），频繁的分支预测失败会严重影响流水线性能。而面对这种情况，无分支代码仍能保证流水线满负荷工作。

现在CPU有好几十级的流水线，分支预测失败的惩罚是相当大的。

日立奔腾浪潮微软松下联想 2012-05-04