c - Code alignment in one object file is affecting the performance of a function in another object file

Question

Welcome To Ask or Share your Answers For Others

c - Code alignment in one object file is affecting the performance of a function in another object file

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

c - Code alignment in one object file is affecting the performance of a function in another object file

I'm familiar with data alignment and performance but I'm rather new to aligning code. I started programming in x86-64 assembly recently with NASM and have been comparing performance using code alignment. As far as I can tell NASM inserts nop instructions to achieve code alignment.

Here is a function I have been trying this on a Ivy Bridge system

void triad(float *x, float *y, float *z, int n, int repeat) {
    float k = 3.14159f;
    int(int r=0; r<repeat; r++) {
        for(int i=0; i<n; i++) {
            z[i] = x[i] + k*y[i];
        }
    }
}

The assembly I'm using for this is below. If I don't specify the alignment my performance compared to the peak is only about 90%. However, when I align the code before the loop as well as both inner loops to 16 bytes the performance jumps to 96%. So clearly the code alignment in this case makes a difference.

But here is the strangest part. If I align the innermost loop to 32 bytes it makes no difference in the performance of this function, however, in another version of this function using intrinsics in a separate object file I link in its performance jumps from 90% to 95%!

I did an object dump (using objdump -d -M intel) of the version aligned to 16 bytes (I posted the result to the end of this question) and 32 bytes and they are identical! It turns out that the inner most loop is aligned to 32 bytes anyway in both object files. But there must be some difference.

I did a hex dump of each object file and there is one byte in the object files that differ. The object file aligned to 16 bytes has a byte with 0x10 and the object file aligned to 32 bytes has a byte with 0x20. What exactly is going on! Why does code alignment in one object file affect the performance of a function in another object file? How do I know what is the optimal value to align my code to?

My only guess is that when the code is relocated by the loader that the 32 byte aligned object file affects the other object file using intrinsics. You can find the code to test all this at Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%

The NASM code I am using:

global triad_avx_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
align 16
section .text
    triad_avx_asm_repeat:
    shl             rcx, 2  
    add             rdi, rcx
    add             rsi, rcx
    add             rdx, rcx
    vbroadcastss    ymm2, [rel pi]
    ;neg                rcx 

align 16
.L1:
    mov             rax, rcx
    neg             rax
align 16
.L2:
    vmulps          ymm1, ymm2, [rdi+rax]
    vaddps          ymm1, ymm1, [rsi+rax]
    vmovaps         [rdx+rax], ymm1
    add             rax, 32
    jne             .L2
    sub             r8d, 1
    jnz             .L1
    vzeroupper
    ret

Result from objdump -d -M intel test16.o. The disassembly is identical if I change align 16 to align 32 in the assembly above just before .L2. However, the object files still differ by one byte.

test16.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <pi>:
   0:   d0 0f                   ror    BYTE PTR [rdi],1
   2:   49                      rex.WB
   3:   40 90                   rex xchg eax,eax
   5:   90                      nop
   6:   90                      nop
   7:   90                      nop
   8:   90                      nop
   9:   90                      nop
   a:   90                      nop
   b:   90                      nop
   c:   90                      nop
   d:   90                      nop
   e:   90                      nop
   f:   90                      nop

0000000000000010 <triad_avx_asm_repeat>:
  10:   48 c1 e1 02             shl    rcx,0x2
  14:   48 01 cf                add    rdi,rcx
  17:   48 01 ce                add    rsi,rcx
  1a:   48 01 ca                add    rdx,rcx
  1d:   c4 e2 7d 18 15 da ff    vbroadcastss ymm2,DWORD PTR [rip+0xffffffffffffffda]        # 0 <pi>
  24:   ff ff 
  26:   90                      nop
  27:   90                      nop
  28:   90                      nop
  29:   90                      nop
  2a:   90                      nop
  2b:   90                      nop
  2c:   90                      nop
  2d:   90                      nop
  2e:   90                      nop
  2f:   90                      nop

0000000000000030 <triad_avx_asm_repeat.L1>:
  30:   48 89 c8                mov    rax,rcx
  33:   48 f7 d8                neg    rax
  36:   90                      nop
  37:   90                      nop
  38:   90                      nop
  39:   90                      nop
  3a:   90                      nop
  3b:   90                      nop
  3c:   90                      nop
  3d:   90                      nop
  3e:   90                      nop
  3f:   90                      nop

0000000000000040 <triad_avx_asm_repeat.L2>:
  40:   c5 ec 59 0c 07          vmulps ymm1,ymm2,YMMWORD PTR [rdi+rax*1]
  45:   c5 f4 58 0c 06          vaddps ymm1,ymm1,YMMWORD PTR [rsi+rax*1]
  4a:   c5 fc 29 0c 02          vmovaps YMMWORD PTR [rdx+rax*1],ymm1
  4f:   48 83 c0 20             add    rax,0x20
  53:   75 eb                   jne    40 <triad_avx_asm_repeat.L2>
  55:   41 83 e8 01             sub    r8d,0x1
  59:   75 d5                   jne    30 <triad_avx_asm_repeat.L1>
  5b:   c5 f8 77                vzeroupper 
  5e:   c3                      ret    
  5f:   90                      nop

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:09:03+0000

Ahhh, code alignment...

Some basics of code alignment..

Most intel architectures fetch 16B worth of instructions per clock.
The branch predictor has a larger window and looks at typically double that, per clock. The idea is to get ahead of the instructions fetched.
How your code is aligned will dictate which instructions you have available to decode and predict at any given clock (simple code locality argument).
Most modern intel architectures cache instructions at various levels (either at the macro instructions level before decoding, or at the micro instruction level after decoding). This eliminates the effects of code alignment, as long as you executing out of the micro/macro cache.
Also, most modern intel architectures have some form of loop stream detector that detects loops, again, executing them out of some cache that bypasses the front end fetch mechanism.
Some intel architectures are finicky about what they can cache, and what they can't. There are often dependencies on number of instructions/uops/alignment/branches/etc. Alignment may, in some cases, affect what's cached and what's not, and you can create cases where padding can prevent or cause a loop to get cached.
To make things even more complicated, the addresses of instructions are also use by the branch predictor. They are used in several ways, including (1) as a lookup into a branch prediction buffer to predict branches, (2) as a key/value to maintain some form of global state of branch behavior for prediction purposes, (3) as a key into determining indirect branch targets, etc.. Therefore, alignment can actually have a pretty huge impact on branch prediction, in some case, due to aliasing or other poor prediction.
Some architectures use instruction addresses to determine when to prefetch data, and code alignment can interfere with that, if just the right conditions exist.
Aligning loops is not always a good thing to do, depending on how the code is laid out (especially if there's control flow in the loop).

Having said all that blah blah, your issue could be one of any of these. It's important to look at the disassembly of not just the object, but the executable. You want to see what the final addresses are after everything is linked. Making changes in one object, could affect the alignment/addresses of instructions in another object after linking.

In some cases, it's near impossible to align your code in such a way as to maximize performance, simply due to so many low level architectural behaviors being hard to control and predict (that doesn't necessarily mean this is always the case). In some cases, your best bet is to have some default alignment strategy (say align all entries on 16B boundaries, and outer loops the same) so as you minimize the amount your performance varies from change-to-change. As a general strategy, aligning function entries is good. Aligning loops that are relatively small is good, as long as you're not adding nops in your execution path.

Beyond that, I'd need more info/data to pinpoint your exact problem, but thought some of this may help.. Good luck :)

Categories

c - Code alignment in one object file is affecting the performance of a function in another object file

c - Code alignment in one object file is affecting the performance of a function in another object file

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags