The x86-64 SysV ABI specifies, among other things, how function parameters are passed in registers (first argument in rdi
, then rsi
and so on), and how integer return values are passed back (in rax
and then rdx
for really big values).
What I can't find, however, is what the high bits of parameter or return value registers should be when passing types smaller than 64-bits.
For example, for the following function:
void foo(unsigned x, unsigned y);
... x
will be passed in rdi
and y
in rsi
, but they are only 32-bits. Do the high 32-bits of rdi
and rsi
need to be zero? Intuitively, I would assume yes, but the code generated by all of gcc, clang and icc has specific mov
instructions at the start to zero out the high bits, so it seems like the compilers assume otherwise.
Similarly, the compilers seem to assume that the high bits of the return value rax
may have garbage bits if the return value is smaller than 64-bits. For example, the loops in the following code:
unsigned gives32();
unsigned short gives16();
long sum32_64() {
long total = 0;
for (int i=1000; i--; ) {
total += gives32();
}
return total;
}
long sum16_64() {
long total = 0;
for (int i=1000; i--; ) {
total += gives16();
}
return total;
}
... compile to the following in clang
(and other compilers are similar):
sum32_64():
...
.LBB0_1:
call gives32()
mov eax, eax
add rbx, rax
inc ebp
jne .LBB0_1
sum16_64():
...
.LBB1_1:
call gives16()
movzx eax, ax
add rbx, rax
inc ebp
jne .LBB1_1
Note the mov eax, eax
after the call returning 32-bits, and the movzx eax, ax
after the 16-bit call - both have the effect of zeroing out the top 32 or 48 bits, respectively. So this behavior has some cost - the same loop dealing with a 64-bit return value omits this instruction.
I've read the x86-64 System V ABI document pretty carefully, but I couldn't find whether this behavior documented in the standard.
What are the benefits of such a decision? It seems to me there are clear costs:
Parameter Costs
Costs are imposed on the implementation of callee when dealing with parameter values. and in the functions when dealing with the parameters. Granted, often this cost is zero because the function can effectively ignore the high bits, or the zeroing comes for free since 32-bit operand size instructions can be used which implicitly zero the high bits.
However, costs are often very real in the cases of functions that accept 32-bit arguments and do some math that could benefit from 64-bit math. Take this function for example:
uint32_t average(uint32_t a, uint32_t b) {
return ((uint64_t)a + b) >> 2;
}
A straightforward use of 64-bit math to calculate a function that would otherwise have to carefully deal with overflow (the ability to transform many 32-bit functions in this way is an often unnoticed benefit of 64-bit architectures). This compiles to:
average(unsigned int, unsigned int):
mov edi, edi
mov eax, esi
add rax, rdi
shr rax, 2
ret
Fully 2 out of the 4 instructions (ignoring ret
) are needed just to zero out the high bits. This may be cheap in practice with mov-elimination, but still it seems a big cost to pay.
On other hand, I can't really see a similar corresponding cost for the callers if the ABI were to specify that high bits are zero. Because rdi
and rsi
and the other parameter passing registers are scratch (i.e., can be overwritten by the caller), you only have a couple scenarios (we look at rdi
, but replace it with the paramter reg of your choice):
The value passed to the function in rdi
is dead (not needed) in the post-call code. In that case, whatever instruction last assigned to rdi
simply has to assign to edi
instead. Not only is this free, it is often one byte smaller if you avoid a REX prefix.
The value passed to the function in rdi
is needed after the function. In that case, since rdi
is caller-saved, the caller needs to do a mov
of the value to a callee-saved register anyway. You can generally organize it so that the value starts in the callee saved register (say rbx
) and then is moved to edi
like mov edi, ebx
, so it costs nothing.
I can't see many scenarios where the zeroing costs the caller much. Some examples would be if 64-bit math is needed in the last instruction which assigned rdi
. That seems quite rare though.
Return value costs
Here the decision seems more neutral. Having callees clear out the junk has a definite code (you sometimes see mov eax, eax
instructions to do this), but if garbage is allowed the costs shifts to the callee. Overall, it seems more likely that the caller can clear the junk for free, so allowing garbage doesn't seem overall detrimental to performance.
I suppose one interesting use-case for this behavior is that functions with varying sizes can share an identical implementation. For example, all of the following functions:
short sums(short x, short y) {
return x + y;
}
int sumi(int x, int y) {
return x + y;
}
long suml(long x, long y) {
return x + y;
}
Can actually share the same implementation1:
sum:
lea rax, [rdi+rsi]
ret
1 Whether such folding is actually allowed for functions that have their address taken is very much open to debate.
See Question&Answers more detail:
os