8086 has an instruction for this:
xchg ax, bx
If you really need to swap two regs, xchg ax, bx
is the most efficient way on all x86 CPUs in most cases, modern and ancient including 8086. (You could construct a case where multiple single-uop instructions might be more efficient because of some other weird front-end effect due to surrounding code. Or for 32-bit operand size, where zero-latency mov
made a 3-mov sequence with a temporary register better on Intel CPUs).
For code-size; xchg-with-ax only takes a single byte. This is where the 0x90 NOP encoding comes from: it's xchg ax, ax
, or xchg eax, eax
in 32-bit mode1. Exchanging any other pair of registers takes 2 bytes for the xchg r, r/m
encoding. (+ REX prefix if required in 64-bit mode.)
On an actual 8086, code-fetch was usually the performance bottleneck, so xchg
is by far the best way, especially using the single-byte xchg-with-ax short form.
Footnote 1: (In 64-bit mode, xchg eax, eax
would truncate RAX
to 32 bits, so 0x90 is explicitly a nop
instruction, not also an xchg
).
For 32-bit / 64-bit registers, 3 mov
instructions with a temporary could benefit from mov-elimination where xchg
can't on current Intel CPUs. xchg
is 3 uops on Intel, all of them having 1c latency and needing an execution unit, so one direction has 2c latency but the other has 1c latency. See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for more microarchitectural details about how current CPUs implement it.
On AMD Ryzen, xchg
on 32/64-bit regs is 2 uops and is handled in the rename stage, so it's like two mov
instructions that run in parallel. On earlier AMD CPUs, it's still a 2 uop instruction, but with 1c latency each way.
xor-swaps or add/sub swaps or any other multi-instruction sequence other than mov
are pointless compared to xchg
for registers. They all have 2 and 3 cycle latency, and larger code-size. The only thing that's worth considering is mov
instructions.
Or better, unroll a loop or rearrange your code to not need a swap, or to only need a mov
.
Swapping a register with memory
Note that xchg
with memory has an implied lock
prefix. Do not use xchg
with memory unless performance doesn't matter at all, but code-size does. (e.g. in a bootloader). Or if you need it to be atomic and/or a full memory barrier, because it's both.
(Fun fact: the implicit lock
behaviour was new in 386. On 8086 through 286, xchg
with mem isn't special unless you do lock xchg
, so you can use it efficiently. But modern CPUs even in 16-bit mode do treat xchg mem, reg
the same as lock xchg
)
So normally the most efficient thing to do is use another register:
; emulate xchg [mem], cx efficiently for modern x86
movzx eax, word [mem]
mov [mem], cx
mov cx, ax
If you need to exchange a register with memory and don't have a free scratch register, xor-swap could in some cases be the best option. Using temp memory would require copying the memory value (e.g. to the stack with push [mem]
, or first spilling the register to a 2nd scratch memory location before loading+storing the memory operand.)
The lowest latency way by far is still with a scratch register; often you can pick one that isn't on the critical path, or only needs to be reloaded (not saved in the first place, because the value's already in memory or can be recalculated from other registers with an ALU instruction).
; spill/reload another register
push edx ; save/restore on the stack or anywhere else
movzx edx, word [mem] ; or just mov dx, [mem]
mov [mem], ax
mov eax, edx
pop edx ; or better, just clobber a scratch reg
Two other reasonable (but much worse) options for swapping memory with a register are:
not touching any other registers (except SP
):
; using scratch space on the stack
push [mem] ; [mem] can be any addressing mode, e.g. [bx]
mov [mem], ax
pop ax ; dep chain = load, store, reload.
or not touching anything else:
; using no extra space anywhere
xor ax, [mem]
xor [mem], ax ; read-modify-write has store-forwarding + ALU latency
xor ax, [mem] ; dep chain = load+xor, (parallel load)+xor+store, reload+xor
Using two memory-destination xor
and one memory source would be worse throughput (more stores, and a longer dependency chain).
The push
/pop
version only works for operand-sizes that can be pushed/popped, but xor-swap works for any operand-size. If you can use a temporary on the stack, the save/restore version is probably preferable, unless you need a balance of code-size and speed.