c++ - Add+Mul become slower with Intrinsics - where am I wrong?

Question

Welcome To Ask or Share your Answers For Others

c++ - Add+Mul become slower with Intrinsics - where am I wrong?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c++ - Add+Mul become slower with Intrinsics - where am I wrong?

Having this array:

alignas(16) double c[voiceSize][blockSize];

This is the function I'm trying to optimize:

inline void Process(int voiceIndex, int blockSize) {    
    double *pC = c[voiceIndex];
    double value = start + step * delta;
    double deltaValue = rate * delta;

    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
        pC[sampleIndex] = value + deltaValue * sampleIndex;
    }
}

And this is my intrinsics (SSE2) attempt:

inline void Process(int voiceIndex, int blockSize) {    
    double *pC = c[voiceIndex];
    double value = start + step * delta;
    double deltaValue = rate * delta;

    __m128d value_add = _mm_set1_pd(value);
    __m128d deltaValue_mul = _mm_set1_pd(deltaValue);

    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += 2) {
        __m128d result_mul = _mm_setr_pd(sampleIndex, sampleIndex + 1);
        result_mul = _mm_mul_pd(result_mul, deltaValue_mul);
        result_mul = _mm_add_pd(result_mul, value_add);

        _mm_store_pd(pC + sampleIndex, result_mul);
    }   
}

Which is slower than "scalar" (even if auto-optimized) original code, unfortunately :)

Where's the bottleneck in your opinion? Where am I wrong?

I'm using MSVC, Release/x86, /02 optimization flag (Favor fast code).

EDIT: doing this (suggested by @wim), it seems that performance become better than C version:

inline void Process(int voiceIndex, int blockSize) {    
    double *pC = c[voiceIndex];
    double value = start + step * delta;
    double deltaValue = rate * delta;

    __m128d value_add = _mm_set1_pd(value);
    __m128d deltaValue_mul = _mm_set1_pd(deltaValue);

    __m128d sampleIndex_acc = _mm_set_pd(-1.0, -2.0);
    __m128d sampleIndex_add = _mm_set1_pd(2.0);

    for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex += 2) {
        sampleIndex_acc = _mm_add_pd(sampleIndex_acc, sampleIndex_add);
        __m128d result_mul = _mm_mul_pd(sampleIndex_acc, deltaValue_mul);
        result_mul = _mm_add_pd(result_mul, value_add);

        _mm_store_pd(pC + sampleIndex, result_mul);
    }
}

Why? Is _mm_setr_pd expensive?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:43:06+0000

On my system, g++ test.cpp -march=native -O2 -c -o test

This will output for the normal version (loop body extract):

  30:   c5 f9 57 c0             vxorpd %xmm0,%xmm0,%xmm0
  34:   c5 fb 2a c0             vcvtsi2sd %eax,%xmm0,%xmm0
  38:   c4 e2 f1 99 c2          vfmadd132sd %xmm2,%xmm1,%xmm0
  3d:   c5 fb 11 04 c2          vmovsd %xmm0,(%rdx,%rax,8)
  42:   48 83 c0 01             add    $0x1,%rax
  46:   48 39 c8                cmp    %rcx,%rax
  49:   75 e5                   jne    30 <_Z11ProcessAutoii+0x30>

And for the intrinsics version:

  88:   c5 f9 57 c0             vxorpd %xmm0,%xmm0,%xmm0
  8c:   8d 50 01                lea    0x1(%rax),%edx
  8f:   c5 f1 57 c9             vxorpd %xmm1,%xmm1,%xmm1
  93:   c5 fb 2a c0             vcvtsi2sd %eax,%xmm0,%xmm0
  97:   c5 f3 2a ca             vcvtsi2sd %edx,%xmm1,%xmm1
  9b:   c5 f9 14 c1             vunpcklpd %xmm1,%xmm0,%xmm0
  9f:   c4 e2 e9 98 c3          vfmadd132pd %xmm3,%xmm2,%xmm0
  a4:   c5 f8 29 04 c1          vmovaps %xmm0,(%rcx,%rax,8)
  a9:   48 83 c0 02             add    $0x2,%rax
  ad:   48 39 f0                cmp    %rsi,%rax
  b0:   75 d6                   jne    88 <_Z11ProcessSSE2ii+0x38>

So in short: the compiler automatically generates AVX code from the C version.

Edit after playing a bit more with flags to have SSE2 only in both cases:

g++ test.cpp -msse2 -O2 -c -o test

The compiler still does something different from what you generate with intrinsics. Compiler version:

  30:   66 0f ef c0             pxor   %xmm0,%xmm0
  34:   f2 0f 2a c0             cvtsi2sd %eax,%xmm0
  38:   f2 0f 59 c2             mulsd  %xmm2,%xmm0
  3c:   f2 0f 58 c1             addsd  %xmm1,%xmm0
  40:   f2 0f 11 04 c2          movsd  %xmm0,(%rdx,%rax,8)
  45:   48 83 c0 01             add    $0x1,%rax
  49:   48 39 c8                cmp    %rcx,%rax
  4c:   75 e2                   jne    30 <_Z11ProcessAutoii+0x30>

Intrinsics version:

  88:   66 0f ef c0             pxor   %xmm0,%xmm0
  8c:   8d 50 01                lea    0x1(%rax),%edx
  8f:   66 0f ef c9             pxor   %xmm1,%xmm1
  93:   f2 0f 2a c0             cvtsi2sd %eax,%xmm0
  97:   f2 0f 2a ca             cvtsi2sd %edx,%xmm1
  9b:   66 0f 14 c1             unpcklpd %xmm1,%xmm0
  9f:   66 0f 59 c3             mulpd  %xmm3,%xmm0
  a3:   66 0f 58 c2             addpd  %xmm2,%xmm0
  a7:   0f 29 04 c1             movaps %xmm0,(%rcx,%rax,8)
  ab:   48 83 c0 02             add    $0x2,%rax
  af:   48 39 f0                cmp    %rsi,%rax
  b2:   75 d4                   jne    88 <_Z11ProcessSSE2ii+0x38>

Compiler does not unroll the loop here. It might be better or worse depending on many things. You might want to bench both versions.

Categories

c++ - Add+Mul become slower with Intrinsics - where am I wrong?

c++ - Add+Mul become slower with Intrinsics - where am I wrong?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags