gcc - Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

Question

Welcome To Ask or Share your Answers For Others

gcc - Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

gcc - Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this:

# a is a 64-byte aligned array of double
__m256d b0 = _mm256_broadcast_sd(&b[4*k+0]);
__m256d b1 = _mm256_broadcast_sd(&b[4*k+1]);
__m256d b2 = _mm256_broadcast_sd(&b[4*k+2]);
__m256d b3 = _mm256_broadcast_sd(&b[4*k+3]);

I compiled the code with gcc-4.8.2 on a Sandy Bridge machine. Hardware event counters (Intel PMU) suggests that the CPU actually issue 4 separate load from the L1 cache. Although at this point I'm not limited by L1 latency or bandwidth, I'm very interested to know if there is a way to load the 4 doubles with one 256-bit load (or two 128-bit loads) and shuffle them into 4 YMM registers. I looked through the Intel Intrinsics Guide but couldn't find a way to accomplish the shuffling required. Is that possible?

(If the premise that the CPU doesn't combine the 4 consecutive loads is actually wrong, please let me know.)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:32:09+0000

TL;DR: It's almost always best to just do four broadcast-loads using _mm256_set1_pd(). This very good on Haswell and later, where vbroadcastsd ymm,[mem] doesn't require an ALU shuffle operation, and usually also the best option for Sandybridge/Ivybridge (where it's a 2-uop load + shuffle instruction).

It also means you don't need to care about alignment at all, beyond natural alignment for a double.

The first vector is ready sooner than if you did a two-step load + shuffle, so out-of-order execution can potentially get started on the code using these vectors while the first one is still loading. AVX512 can even fold broadcast-loads into memory operands for ALU instructions, so doing it this way will allow a recompile to take slight advantage of AVX512 with 256b vectors.

(It's usually best to use set1(x), not _mm256_broadcast_sd(&x); If the AVX2-only register-source form of vbroadcastsd isn't available, the compiler can choose to store -> broadcast-load or to do two shuffles. You never know when inlining will mean your code will run on inputs that are already in registers.)

If you're really bottlenecked on load-port resource-conflicts or throughput, not total uops or ALU / shuffle resources, it might help to replace a pair of 64->256b broadcasts with a 16B->32B broadcast-load (vbroadcastf128/_mm256_broadcast_pd) and two in-lane shuffles (vpermilpd or vunpckl/hpd (_mm256_shuffle_pd)).

Or with AVX2: load 32B and use 4 _mm256_permute4x64_pd shuffles to broadcast each element into a separate vector.

Source Agner Fog's insn tables (and microarch pdf):

Intel Haswell and later:

vbroadcastsd ymm,[mem] and other broadcast-load insns are 1uop instructions that are handled entirely by a load port (the broadcast happens "for free").

The total cost of doing four broadcast-loads this way is 4 instructions. fused-domain: 4uops. unfused-domain: 4 uops for p2/p3. Throughput: two vectors per cycle.

Haswell only has one shuffle unit, on port5. Doing all your broadcast-loads with load+shuffle will bottleneck on p5.

Maximum broadcast throughput is probably with a mix of vbroadcastsd ymm,m64 and shuffles:

## Haswell maximum broadcast throughput with AVX1
vbroadcastsd    ymm0, [rsi]
vbroadcastsd    ymm1, [rsi+8]
vbroadcastf128  ymm2, [rsi+16]     # p23 only on Haswell, also p5 on SnB/IvB
vunpckhpd       ymm3, ymm2,ymm2
vunpcklpd       ymm2, ymm2,ymm2
vbroadcastsd    ymm4, [rsi+32]     # or vaddpd ymm0, [rdx+something]
#add             rsi, 40

Any of these addressing modes can be two-register indexed addressing modes, because they don't need to micro-fuse to be a single uop.

AVX1: 5 vectors per 2 cycles, saturating p2/p3 and p5. (Ignoring cache-line splits on the 16B load). 6 fused-domain uops, leaving only 2 uops per 2 cycles to use the 5 vectors... Real code would probably use some of the load throughput to load something else (e.g. a non-broadcast 32B load from another array, maybe as a memory operand to an ALU instruction), or to leave room for stores to steal p23 instead of using p7.

## Haswell maximum broadcast throughput with AVX2
vmovups         ymm3, [rsi]
vbroadcastsd    ymm0, xmm3         # special-case for the low element; compilers should generate this from _mm256_permute4x64_pd(v, 0)
vpermpd         ymm1, ymm3, 0b01_01_01_01   # NASM syntax for 0x99
vpermpd         ymm2, ymm3, 0b10_10_10_10
vpermpd         ymm3, ymm3, 0b11_11_11_11
vbroadcastsd    ymm4, [rsi+32]
vbroadcastsd    ymm5, [rsi+40]
vbroadcastsd    ymm6, [rsi+48]
vbroadcastsd    ymm7, [rsi+56]
vbroadcastsd    ymm8, [rsi+64]
vbroadcastsd    ymm9, [rsi+72]
vbroadcastsd    ymm10,[rsi+80]    # or vaddpd ymm0, [rdx + whatever]
#add             rsi, 88

AVX2: 11 vectors per 4 cycles, saturating p23 and p5. (Ignoring cache-line splits for the 32B load...). Fused-domain: 12 uops, leaving 2 uops per 4 cycles beyond this.

I think 32B unaligned loads are a bit more fragile in terms of performance than unaligned 16B loads like vbroadcastf128.

Intel SnB/IvB:

vbroadcastsd ymm, m64 is 2 fused-domain uops: p5 (shuffle) and p23 (load).

vbroadcastss xmm, m32 and movddup xmm, m64 are single-uop load-port-only. Interestingly, vmovddup ymm, m256 is also a single-uop load-port-only instruction, but like all 256b loads, it occupies a load port for 2 cycles. It can still generate a store-address in the 2nd cycle. This uarch doesn't deal well with cache-line splits for unaligned 32B-loads, though. gcc defaults to using movups / vinsertf128 for unaligned 32B loads with -mtune=sandybridge / -mtune=ivybridge.

4x broadcast-load: 8 fused-domain uops: 4 p5 and 4 p23. Throughput: 4 vectors per 4 cycles, bottlenecking on port 5. Multiple loads from the same cache line in the same cycle don't cause a cache-bank conflict, so this is nowhere near saturating the load ports (also needed for store-address generation). That only happens on the same bank of two different cache lines in the same cycle.

Multiple 2-uop instructions with no other instructions between is the worst case for the decoders if the uop-cache is cold, but a good compiler would mix in single-uop instructions between them.

SnB has 2 shuffle units, but only the one on p5 can handle shuffles that have a 256b version in AVX. Using a p1 integer-shuffle uop to broadcast a double to both elements of an xmm register doesn't get us anywhere, since vinsertf128 ymm,ymm,xmm,i takes a p5 shuffle uop.

## Sandybridge maximum broadcast throughput: AVX1
vbroadcastsd    ymm0, [rsi]
add             rsi, 8

one per clock, saturating p5 but only using half the capacity of p23.

We can save one load uop at the cost of 2 more shuffle uops, throughput = two results per 3 clocks:

vbroadcastf128  ymm2, [rsi+16]     # 2 uops: p23 + p5 on SnB/IvB
vunpckhpd       ymm3, ymm2,ymm2    # 1 uop: p5
vunpcklpd       ymm2, ymm2,ymm2    # 1 uop: p5

Doing a 32B load and unpacking it with 2x vperm2f128 -> 4x vunpckh/lpd might help if stores are part of what's competing for p23.

Categories

gcc - Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

gcc - Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Intel Haswell and later:

Intel SnB/IvB:

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags