swapping 2 registers in 8086 assembly language(16 bits)

Question

Welcome To Ask or Share your Answers For Others

swapping 2 registers in 8086 assembly language(16 bits)

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:10:09+0000

8086 has an instruction for this:

xchg   ax, bx

If you really need to swap two regs, xchg ax, bx is the most efficient way on all x86 CPUs in most cases, modern and ancient including 8086. (You could construct a case where multiple single-uop instructions might be more efficient because of some other weird front-end effect due to surrounding code. Or for 32-bit operand size, where zero-latency mov made a 3-mov sequence with a temporary register better on Intel CPUs).

For code-size; xchg-with-ax only takes a single byte. This is where the 0x90 NOP encoding comes from: it's xchg ax, ax, or xchg eax, eax in 32-bit mode¹. Exchanging any other pair of registers takes 2 bytes for the xchg r, r/m encoding. (+ REX prefix if required in 64-bit mode.)

On an actual 8086, code-fetch was usually the performance bottleneck, so xchg is by far the best way, especially using the single-byte xchg-with-ax short form.

Footnote 1: (In 64-bit mode, xchg eax, eax would truncate RAX to 32 bits, so 0x90 is explicitly a nop instruction, not also an xchg).

For 32-bit / 64-bit registers, 3 mov instructions with a temporary could benefit from mov-elimination where xchg can't on current Intel CPUs. xchg is 3 uops on Intel, all of them having 1c latency and needing an execution unit, so one direction has 2c latency but the other has 1c latency. See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for more microarchitectural details about how current CPUs implement it.

On AMD Ryzen, xchg on 32/64-bit regs is 2 uops and is handled in the rename stage, so it's like two mov instructions that run in parallel. On earlier AMD CPUs, it's still a 2 uop instruction, but with 1c latency each way.

xor-swaps or add/sub swaps or any other multi-instruction sequence other than mov are pointless compared to xchg for registers. They all have 2 and 3 cycle latency, and larger code-size. The only thing that's worth considering is mov instructions.

Or better, unroll a loop or rearrange your code to not need a swap, or to only need a mov.

Swapping a register with memory

Note that xchg with memory has an implied lock prefix. Do not use xchg with memory unless performance doesn't matter at all, but code-size does. (e.g. in a bootloader). Or if you need it to be atomic and/or a full memory barrier, because it's both.

(Fun fact: the implicit lock behaviour was new in 386. On 8086 through 286, xchg with mem isn't special unless you do lock xchg, so you can use it efficiently. But modern CPUs even in 16-bit mode do treat xchg mem, reg the same as lock xchg)

So normally the most efficient thing to do is use another register:

     ; emulate  xchg [mem], cx  efficiently for modern x86
   movzx  eax, word [mem]
   mov    [mem], cx
   mov    cx, ax

If you need to exchange a register with memory and don't have a free scratch register, xor-swap could in some cases be the best option. Using temp memory would require copying the memory value (e.g. to the stack with push [mem], or first spilling the register to a 2nd scratch memory location before loading+storing the memory operand.)

The lowest latency way by far is still with a scratch register; often you can pick one that isn't on the critical path, or only needs to be reloaded (not saved in the first place, because the value's already in memory or can be recalculated from other registers with an ALU instruction).

; spill/reload another register
push  edx            ; save/restore on the stack or anywhere else

movzx edx, word [mem]    ; or just mov dx, [mem]
mov   [mem], ax
mov   eax, edx

pop   edx            ; or better, just clobber a scratch reg

Two other reasonable (but much worse) options for swapping memory with a register are:

not touching any other registers (except SP):

  ; using scratch space on the stack
  push [mem]           ; [mem] can be any addressing mode, e.g. [bx]
  mov  [mem], ax
  pop  ax              ; dep chain = load, store, reload.

or not touching anything else:

  ; using no extra space anywhere
  xor  ax, [mem]
  xor  [mem], ax        ; read-modify-write has store-forwarding + ALU latency
  xor  ax, [mem]        ; dep chain = load+xor, (parallel load)+xor+store, reload+xor

Using two memory-destination xor and one memory source would be worse throughput (more stores, and a longer dependency chain).

The push/pop version only works for operand-sizes that can be pushed/popped, but xor-swap works for any operand-size. If you can use a temporary on the stack, the save/restore version is probably preferable, unless you need a balance of code-size and speed.

Categories

swapping 2 registers in 8086 assembly language(16 bits)