Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
643 views
in Technique[技术] by (71.8m points)

performance - Any way to move 2 bytes in 32-bit x86 using MOV without causing a mode switch or cpu stall?

If I want to move 2 unsigned bytes from memory into a 32-bit register, can I do that with a MOV instruction and no mode switch?

I notice that you CAN do that with the MOVSE and MOVZE instructions. For example, with MOVSE the encoding 0F B7 moves 16 bits to a 32 bit register. It is a 3 cycle instruction, though.

Alternatively I guess I could move 4 bytes into the register and then somehow CMP just two of them somehow.

What is the fastest strategy for retrieving and comparing 16-bit data on 32-bit x86? Note that I am mostly doing 32-bit operations so I can't switch to 16-bit mode and stay there.


FYI to the uninitiated: the issue here is that 32-bit Intel x86 processors can MOV 8-bit data and 16-bit OR 32-bit data depending on what mode they are in. This mode is called the "D-bit" setting. You can use special prefixes 0x66 and 0x67 to use a non-default mode. For example, if you are in 32-bit mode, and you prefix the instruction with 0x66 this will cause the operand to be treated as 16-bit. The only problem is that doing this causes a big performance hit.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Use movzx to load narrow data on modern CPUs. (Or movsx if it's useful to have it sign-extended instead of zero-extended, but movzx is sometimes faster and never slower.)


movzx is only slow on the ancient P5 (original Pentium) microarchitecture, not anything made this century. Pentium-branded CPUs based on recent microarchitectures, like Pentium G3258 (Haswell, 20th anniversary edition of original Pentium) are totally different beasts, and perform like the equivalent i3 but without AVX, BMI1/2, or hyperthreading.

Don't tune modern code based on P5 guidelines / numbers. However, Knight's Corner (Xeon Phi) is based on a modified P54C microarchitecture, so perhaps it has slow movzx as well. Neither Agner Fog nor Instlatx64 have per-instruction throughput / latency numbers for KNC.


Using a 16-bit operand size instruction doesn't switch the whole pipeline over to 16-bit mode or cause a big perf hit. See Agner Fog's microarch pdf to learn exactly what is and isn't slow on various x86 CPU microarchitectures (including ones as old as Intel P5 (original Pentium) which you seem to be talking about for some reason).

Writing a 16-bit register and then reading the full 32/64-bit register is slow on some CPU (partial-register stall when merging on Intel P6-family). On others, writing a 16-bit register merges into the old value so there's a false dependency on the old value of the full register when you write, even if you never read the full register. See which CPU does what. (Note that Haswell/Skylake only rename AH separately, unlike Sandybridge which (like Core2/Nehalem) also renames AL / AX separately from RAX, but merges without stalling.)


Unless you specifically care about in-order P5 (or possibly Knight's Corner Xeon Phi, based on the same core, but IDK if movzx is slow there, too), USE THIS:

movzx   eax, word [src1]        ; as efficient as a 32-bit MOV load on most CPUs
cmp      ax, word [src2]

The operand-size prefix for cmp decodes efficiently on all modern CPUs. Reading a 16-bit register after writing the full register is always fine, and the 16-bit load for the other operand is also fine.

The operand-size prefix isn't length-changing because there's no imm16 / imm32. e.g. cmp word [src2], 0x7F is fine (it can use a sign-extended imm8), but
cmp word [src2], 0x80 needs an imm16 and will LCP-stall on some Intel CPUs. (Without the operand-size prefix, the same opcode would have an imm32, i.e. the rest of the instruction would be a different length). Instead, use mov eax, 0x80 / cmp word [src2], ax.

The address-size prefix can be length-changing in 32-bit mode (disp32 vs. disp16), but we don't want to use 16-bit addressing modes to access 16-bit data. We're still using [ebx+1234] (or rbx), not [bx+1234].


On modern x86: Intel P6 / SnB-family / Atom / Silvermont, AMD since at least K7, i.e. anything made in this century, newer than actual P5 Pentium, movzx loads are very efficient.

On many CPUs, the load ports directly support movzx (and sometimes also movsx), so it runs as just a load uop, not as a load + ALU.

Data from Agner Fog's instruction-set tables: Note they may not cover every corner case, e.g. mov-load numbers might only be for 32 / 64-bit loads. Also note that Agner Fog's load latency numbers are not load-use latency from L1D cache; they only make sense as part of the store/reload (store-forwarding) latency, but relative numbers will tell us how many cycles movzx adds on top of mov (often no extra cycles).

(Update: https://uops.info/ has better test results that actually reflect load-use latency, and they're automated so typos and clerical errors in updating the spreadsheets aren't a problem. But uops.info only goes back to Conroe (first-gen Core 2) for Intel, and only Zen for AMD.)

  • P5 Pentium (in-order execution): movzx-load is a 3-cycle instruction (plus a decode bottleneck from the 0F prefix), vs. mov-loads being single cycle throughput. (They still have latency, though).

  • Intel:

  • PPro / Pentium II / III: movzx/movsx run on just a load port, same throughput as plain mov.

  • Core2 / Nehalem: same, including 64-bit movsxd, except on Core 2 where a movsxd r64, m32 load costs a load + ALU uop, which don't micro-fuse.

  • Sandybridge-family (SnB through Skylake and later): movzx/movsx loads are single-uop (just a load port), and perform identically to mov loads.

  • Pentium4 (netburst): movzx runs on the load port only, same perf as mov. movsx is load + ALU, and takes 1 extra cycle.

  • Atom (in-order): Agner's table is unclear for memory-source movzx/movsx needing an ALU, but they're definitely fast. The latency number is only for reg,reg.

  • Silvermont: same as Atom: fast but unclear on needing a port.

  • KNL (based on Silvermont): Agner lists movzx/movsx with a memory source as using IP0 (ALU), but latency is the same as mov r,m so there's no penalty. (execution-unit pressure is not a problem because KNL's decoders can barely keep its 2 ALUs fed anyway.)

  • AMD:

  • Bobcat: movzx/movsx loads are 1 per clock, 5 cycle latency. mov-load is 4c latency.

  • Jaguar: movzx/movsx loads are 1 per clock, 4 cycle latency. mov loads are 1 per clock, 3c latency for 32/64-bit, or 4c for mov r8/r16, m (but still only an AGU port, not an ALU merge like Haswell/Skylake do).

  • K7/K8/K10: movzx/movsx loads have 2-per-clock throughput, latency 1 cycle higher than a mov load. They use an AGU and an ALU.

  • Bulldozer-family: same as K10, but movsx-load has 5 cycle latency. movzx-load has 4 cycle latency, mov-load has 3 cycle latency. So in theory it might be lower latency to mov cx, word [mem] and then movsx eax, cx (1 cycle), if the false dependency from a 16-bit mov load doesn't require an extra ALU merge, or create a loop-carried dependency for your loop.

  • Ryzen: movzx/movsx loads run in the load port only, same latency as mov loads.

  • VIA

  • Via Nano 2000/3000: movzx runs on the load port only, same latency as mov loads. movsx is LD + ALU, with 1c extra latency.

When I say "perform identically", I mean not counting any partial-register penalties or cache-line splits from a wider load. e.g. a movzx eax, word [rsi] avoids a merging penalty vs mov ax, word [rsi] on Skylake, but I'll still say that mov performs identically to movzx. (I guess I mean that mov eax, dword [rsi] without any cache-line splits is as fast as movzx eax, word [rsi].)


xor-zeroing the full register before writing a 16-bit register avoids a later partial-register merging stall on Intel P6-family, as well as breaking false dependencies.

If you want to run well on P5 as well, this might be somewhat better there while not being much worse on any modern CPUs except PPro to PIII where xor-zeroing isn't dep-breaking, even though it is still recognized as a zeroing-idiom making EAX equivalent to AX (no partial-register stall when reading EAX after writing AL or AX).

;; Probably not a good idea, maybe not faster on anything.

;mov  eax, 0             ; some code tuned for PIII used *both* this and xor-zeroing.
xor   eax, eax           ; *not* dep-breaking on early P6 (up to PIII)
mov    ax, word [src1]
cmp    ax, word [src2]

; safe to read EAX without partial-reg stalls

The operand-size prefix isn't ideal for P5, so you could consider using a 32-bit load if you're sure it doesn't fault, cross a cache-line boundary, or cause a store-forwarding failure from a recent 16-bit store.

Actually, I think a 16-bit mov load might be slower on Pentium than the movzx/cmp 2 instruction sequence. There really doesn't seem to be a good option for working with 16-bit data as efficiently as 32-bit! (Other than packed MMX stuff, of course).

See Agner Fog's guide for the Pentium details, but the operand-size prefix takes an extra 2 cycles to decode on P1 (original P5) and PMMX, so this sequence may actually be worse than a movzx load. On P1 (but not PMMX), the 0F escape byte (used by movzx) also counts as a prefix, taking an extra cycle to decode.

Apparently movzx isn't pairable anyway. Multi-cycle movzx will hide the decode latency of cmp ax, [src2], so movzx / cmp is probably still the best choice. Or schedule instructions so the movzx is done earlier and the cmp can maybe pair with something. Anyway, the scheduling rules are quite complicated for P1/PMMX.


I timed this loop on Core2 (Conroe) to prove that xor-zeroing avoids partial register stalls for 16-bit registers as well as low-8 (like for setcc al):

mov     ebp, 100000000
ALIGN 32
.loop:
%rep 4
    xor   eax, eax
;    mov   eax, 1234    ; just break dep on the old value, not a zeroing idiom
    mov   ax, cx        ; write AX
    mov   edx, eax      ; read EAX
%endrep

    dec   ebp           ; Core2 can't fuse dec / jcc even in 32-bit mode
    jg   .loop          ; but SnB does

perf stat -r4 ./testloop output for this in a static binary that makes a sys_exit system call after :

 ;; Core2 (Conroe) with   XOR eax, eax
       469,277,071      cycles                    #    2.396 GHz
     1,400,878,601      instructions              #    2.98  insns per cycle
       100,156,594      branches 

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...