performance - libsvm compiled with AVX vs no AVX

Question

Welcome To Ask or Share your Answers For Others

performance - libsvm compiled with AVX vs no AVX

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

performance - libsvm compiled with AVX vs no AVX

I compiled a libsvm benchmarking app which does svm_predict() 100 times on the same image using the same model. The libsvm is compiled statically (MSVC 2017) by directly including svm.cpp and svm.h in my project.

EDIT: adding benchmark details

for (int i = 0; i < counter; i++)
    {
        std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
        double label = svm_predict(model, input);
        std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();

        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();

        total_time += duration;

        std::cout << "


" << sum << " label:" << label << " duration:" << duration << "


";
    }

This is the loop that I benchmark without any major modifications to the libsvm code.

After 100 runs the average of one run is 4.7 ms with no difference if I use or not AVX instructions. To make sure the compiler generates the correct instructions I used Intel Software Development Emulator to check the instructions mix

with AVX:
*isa-ext-AVX                                                    36578280
*isa-ext-SSE                                                           4
*isa-ext-SSE2                                                          4
*isa-set-SSE                                                           4
*isa-set-SSE2                                                          4
*scalar-simd                                                    36568174
*sse-scalar                                                            4
*sse-packed                                                            4
*avx-scalar                                                     36568170
*avx128                                                             8363
*avx256                                                             1765

The other part

without AVX:
*isa-ext-SSE                                                       11781
*isa-ext-SSE2                                                   36574119
*isa-set-SSE                                                       11781
*isa-set-SSE2                                                   36574119
*scalar-simd                                                    36564559
*sse-scalar                                                     36564559
*sse-packed                                                        21341

I would expect to get some performance improvment I know that avx128/256/512 are not used that much but still. I have a i7-8550U CPU, do you think that if run the same test on a skylake i9 X series I would see a bigger difference ?

EDIT I added the instruction mix for each binary

With AVX:

ADD                                                             16868725
AND                                                                   49
BT                                                                     6
CALL_NEAR                                                       14032515
CDQ                                                                    4
CDQE                                                                3601
CMOVLE                                                                 6
CMOVNZ                                                                 2
CMOVO                                                                 12
CMOVZ                                                                  6
CMP                                                             25417120
CMPXCHG_LOCK                                                           1
CPUID                                                                  3
CQO                                                                   12
DEC                                                                   68
DIV                                                                    1
IDIV                                                                  12
IMUL                                                                3621
INC                                                              8496372
JB                                                                   325
JBE                                                                    5
JL                                                                  7101
JLE                                                                38338
JMP                                                              8416984
JNB                                                                    6
JNBE                                                                   3
JNL                                                                  806
JNLE                                                                  61
JNS                                                                    1
JNZ                                                             22568320
JS                                                                     2
JZ                                                               8465164
LEA                                                             16829868
MOV                                                             42209230
MOVSD_XMM                                                              4
MOVSXD                                                              1141
MOVUPS                                                                 4
MOVZX                                                               3684
MUL                                                                   12
NEG                                                                   72
NOP                                                                 4219
NOT                                                                    1
OR                                                                    14
POP                                                                 1869
PUSH                                                                1870
REP_STOSD                                                              6
RET_NEAR                                                            1758
ROL                                                                    5
ROR                                                                   10
SAR                                                                    8
SBB                                                                    5
SETNZ                                                                  4
SETZ                                                                  26
SHL                                                                 1626
SHR                                                                  519
SUB                                                                 6530
TEST                                                             5616533
VADDPD                                                               594
VADDSD                                                           8445597
VCOMISD                                                                3
VCVTSI2SD                                                           3603
VEXTRACTF128                                                           6
VFMADD132SD                                                           12
VFMADD231SD                                                            6
VHADDPD                                                                6
VMOVAPD                                                               12
VMOVAPS                                                             2375
VMOVDQU                                                                1
VMOVSD                                                          11256384
VMOVUPD                                                              582
VMULPD                                                               582
VMULSD                                                           8451540
VPXOR                                                                  1
VSUBSD                                                           8407425
VUCOMISD                                                            3600
VXORPD                                                              2362
VXORPS                                                              3603
VZEROUPPER                                                             4
XCHG                                                                   8
XGETBV                                                                 1
XOR                                                              8414763
*total                                                         213991340

Part2

No AVX:
ADD                                                             16869910
ADDPD                                                               1176
ADDSD                                                            8445609
AND                                                                   49
BT                                                                     6
CALL_NEAR                                                       14032515
CDQ                                                                    4
CDQE                                                                3601
CMOVLE                                                                 6
CMOVNZ                                                                 2
CMOVO                                                                 12
CMOVZ                                                                  6
CMP                                                             25417408
CMPXCHG_LOCK                                                           1
COMISD                                                                 3
CPUID                                                                  3
CQO                                                                   12
CVTDQ2PD                                                            3603
DEC                                                                   68
DIV                                                                    1
IDIV                                                                  12
IMUL                                                                3621
INC                                                              8496369
JB                                                                   325
JBE                                                                    5
JL                                                                  7392
JLE                                                                38338
JMP                                                              8416984
JNB                                                                    6
JNBE                                                                   3
JNL                                                                  803
JNLE                                                                  61
JNS                                                                    1
JNZ                                                             22568317
JS                                                                     2
JZ                                                               8465164
LEA                                                             16829548
MOV                                                             42209235
MOVAPS

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:33:57+0000

Almost all arithmetic instructions you are listing work on scalars e.g., (V)SUBSD means SUBstract Scalar Double. The V in front essentially just means that AVX encoding is used (this also clears the upper half of the register, which the SSE instructions don't do). But given the instructions you listed, there should be barely any runtime difference.

Modern x86 uses SSE1/2 or AVX for scalar FP math, using just the low element of XMM vector registers. It's somewhat better than x87 (more registers, and flat register set), but it's still only one result per instruction.

There are a few thousand packed SIMD instructions, vs. ~36 million scalar instructions, so only a relatively unimportant part of the code got auto-vectorized and could benefit from 256-bit vectors.

Categories

performance - libsvm compiled with AVX vs no AVX

performance - libsvm compiled with AVX vs no AVX

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags