I learned something useful from this question. Let's first look at some scalar code
extern foo2(int x, int y);
void foo(int x, int y) {
if((x || y)!=0) foo2(x,y);
}
Compile this like this gcc -O3 -S -masm=intel test.c
and the important assembly is
mov eax, edi ; edi = x, esi = y -> copy x into eax
or eax, esi ; eax = x | y and set zero flag in FLAGS if zero
jne .L4 ; jump not zero
Now let's look at testing SIMD registers for zero. Unlike scalar code there is no SIMD FLAGS register. However, with SSE4.1 there are SIMD test instructions which can set the zero flag (and carry flag) in the scalar FLAGS register.
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
__m128i z = _mm_or_si128(x,y);
if (!_mm_testz_si128(z,z)) foo2(x,y);
}
Compile with c99 -msse4.1 -O3 -masm=intel -S test_SSE.c
and the the important assembly is
movdqa xmm2, xmm0 ; xmm0 = x, xmm1 = y, copy x into xmm2
por xmm2, xmm1 ; xmm2 = x | y
ptest xmm2, xmm2 ; set zero flag if zero
jne .L4 ; jump not zero
Notice that this takes one more instruction because the packed bit-wise OR does not set the zero flag. Notice also that both the scalar version and the SIMD version need to use an additional register (eax
in the scalar case and xmm2
in the SIMD case). So to answer your question your current solution is the best you can do.
However, if you did not have a processor with SSE4.1 or better you would have to use _mm_movemask_epi8
. Another alternative which only needs SSE2 is to use _mm_movemask_epi8
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
if (_mm_movemask_epi8(_mm_or_si128(x,y))) foo2(x,y);
}
The important assembly is
movdqa xmm2, xmm0
por xmm2, xmm1
pmovmskb eax, xmm2
test eax, eax
jne .L4
Notice that this needs one more instruction then with the SSE4.1 ptest
instruction.
Until now I have been using the pmovmaskb
instruction because the latency is better on pre Sandy Bridge processors than with ptest
. However, I realized this before Haswell. On Haswell the latency of pmovmaskb
is worse than the latency of ptest
. They both have the same throughput. But in this case this is not really important. What's important (which I did not realize before) is that pmovmaskb
does not set the FLAGS register and so it requires another instruction. So now I'll be using ptest
in my critical loop. Thank you for your question.
Edit: as suggested by the OP there is a way this can be done without using another SSE register.
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
if (_mm_movemask_epi8(x) | _mm_movemask_epi8(y)) foo2(x,y);
}
The relevant assembly from GCC is:
pmovmskb eax, xmm0
pmovmskb edx, xmm1
or edx, eax
jne .L4
Instead of using another xmm register this uses two scalar registers.
Note that fewer instructions does not necessarily mean better performance. Which of these solutions is best? You have to test each of them to find out.