Use the x86-64 System V calling convention for your helper functions that you want to piece together via function pointers. In that calling convention, all of xmm/ymm0..15 and zmm0..31 are call-clobbered so even helper functions that need more than 5 vector registers don't have to save/restore any.
The outer interpreter function that calls them should still use Windows x64 fastcall or vectorcall, so from the outside it fully respects that calling convention.
This will hoist all the save/restore of XMM6..15 into that caller, instead of each helper function. This reduces static code size and amortizes the runtime cost over multiple calls through your function pointers.
AFAIK, MSVC doesn't support marking functions as using the x86-64 System V calling convention, only fastcall vs. vectorcall, so you'll have to use clang.
(ICC is buggy and fails to save/restore XMM6..15 around a call to a System V ABI function).
Windows GCC is buggy with 32-byte stack alignment for spilling __m256
, so it's not in general safe to use GCC with -march=
with anything that includes AVX.
Use __attribute__((sysv_abi))
or __attribute__((ms_abi))
on function and function-pointer declarations.
I think ms_abi
is __fastcall
, not __vectorcall
. Clang may support __attribute__((vectorcall))
as well, but I haven't tried it. Google results are mostly feature requests/discussion.
void (*helpers[10])(float *, float*) __attribute__((sysv_abi));
__attribute__((ms_abi))
void outer(float *p) {
helpers[0](p, p+10);
helpers[1](p, p+10);
helpers[2](p+20, p+30);
}
compiles as follows on Godbolt with clang 8.0 -O3 -march=skylake
. (gcc/clang on Godbolt target Linux, but I used explicit ms_abi
and sysv_abi
on both the function and function-pointers so the code gen doesn't depend on the fact that the default is sysv_abi
. Obviously you'd want to build your function with a Windows gcc or clang so calls to other functions would use the right calling convention. And a useful object-file format, etc.)
Notice that gcc/clang emit code for outer()
that expects the incoming pointer arg in RCX (Windows x64), but pass it to the callees in RDI and RSI (x86-64 System V).
outer: # @outer
push r14
push rsi
push rdi
push rbx
sub rsp, 168
vmovaps xmmword ptr [rsp + 144], xmm15 # 16-byte Spill
vmovaps xmmword ptr [rsp + 128], xmm14 # 16-byte Spill
vmovaps xmmword ptr [rsp + 112], xmm13 # 16-byte Spill
vmovaps xmmword ptr [rsp + 96], xmm12 # 16-byte Spill
vmovaps xmmword ptr [rsp + 80], xmm11 # 16-byte Spill
vmovaps xmmword ptr [rsp + 64], xmm10 # 16-byte Spill
vmovaps xmmword ptr [rsp + 48], xmm9 # 16-byte Spill
vmovaps xmmword ptr [rsp + 32], xmm8 # 16-byte Spill
vmovaps xmmword ptr [rsp + 16], xmm7 # 16-byte Spill
vmovaps xmmword ptr [rsp], xmm6 # 16-byte Spill
mov rbx, rcx # save p
lea r14, [rcx + 40]
mov rdi, rcx
mov rsi, r14
call qword ptr [rip + helpers]
mov rdi, rbx
mov rsi, r14
call qword ptr [rip + helpers+8]
lea rdi, [rbx + 80]
lea rsi, [rbx + 120]
call qword ptr [rip + helpers+16]
vmovaps xmm6, xmmword ptr [rsp] # 16-byte Reload
vmovaps xmm7, xmmword ptr [rsp + 16] # 16-byte Reload
vmovaps xmm8, xmmword ptr [rsp + 32] # 16-byte Reload
vmovaps xmm9, xmmword ptr [rsp + 48] # 16-byte Reload
vmovaps xmm10, xmmword ptr [rsp + 64] # 16-byte Reload
vmovaps xmm11, xmmword ptr [rsp + 80] # 16-byte Reload
vmovaps xmm12, xmmword ptr [rsp + 96] # 16-byte Reload
vmovaps xmm13, xmmword ptr [rsp + 112] # 16-byte Reload
vmovaps xmm14, xmmword ptr [rsp + 128] # 16-byte Reload
vmovaps xmm15, xmmword ptr [rsp + 144] # 16-byte Reload
add rsp, 168
pop rbx
pop rdi
pop rsi
pop r14
ret
GCC makes basically the same code. But Windows GCC is buggy with AVX.
ICC19 makes similar code, but without the save/restore of xmm6..15. This is a showstopper bug; if any of the callees do clobber those regs like they're allowed to, then returning from this function will violate its calling convention.
This leaves clang as the only compiler that you can use. That's fine; clang is very good.
If your callees don't need all the YMM registers, saving/restoring all of them in the outer function is overkill. But there's no middle ground with existing toolchains; you'd have to hand-write outer
in asm to take advantage of knowing that none of your possible callees ever clobber XMM15 for example.
Note that calling other MS-ABI functions from inside outer()
is totally fine. GCC / clang will (barring bugs) emit correct code for that too, and it's fine if a called function chooses not to destroy xmm6..15.