I have been studying measuring memory bandwidth for Intel processors with various operations and one of them is memcpy
. I have done this on Core2, Ivy Bridge, and Haswell. I did most of my tests using C/C++ with intrinsics (see the code below - but I'm currently rewriting my tests in assembly).
To write your own efficient memcpy
function it's important to know what the absolute best bandwidth possible is. This bandwidth is a function of the size of the arrays which will be copied and therefore an efficient memcpy
function needs to optimize differently for small and big (and maybe in between). To keep things simple I have optimized for small arrays of 8192 bytes and large arrays of 1 GB.
For small arrays the maximum read and write bandwidth for each core is:
Core2-Ivy Bridge 32 bytes/cycle
Haswell 64 bytes/cycle
This is the benchmark you should aim for small arrays. For my tests I assume the arrays are aligned to 64-bytes and that the array size is a multiple of 8*sizeof(float)*unroll_factor
. Here are my current memcpy
results for a size of 8192 bytes (Ubuntu 14.04, GCC 4.9, EGLIBC 2.19):
GB/s efficiency
Core2 (p9600@2.66 GHz)
builtin 35.2 41.3%
eglibc 39.2 46.0%
asmlib: 76.0 89.3%
copy_unroll1: 39.1 46.0%
copy_unroll8: 73.6 86.5%
Ivy Bridge (E5-1620@3.6 GHz)
builtin 102.2 88.7%
eglibc: 107.0 92.9%
asmlib: 107.6 93.4%
copy_unroll1: 106.9 92.8%
copy_unroll8: 111.3 96.6%
Haswell (i5-4250U@1.3 GHz)
builtin: 68.4 82.2%
eglibc: 39.7 47.7%
asmlib: 73.2 87.6%
copy_unroll1: 39.6 47.6%
copy_unroll8: 81.9 98.4%
The asmlib
is Agner Fog's asmlib. The copy_unroll1
and copy_unroll8
functions are defined below.
From this table we can see that the GCC builtin memcpy
does not work well on Core2 and that memcpy
in EGLIBC does not work well on Core2 or Haswell. I did check out a head version of GLIBC recently and the performance was much better on Haswell. In all cases unrolling gets the best result.
void copy_unroll1(const float *x, float *y, const int n) {
for(int i=0; i<n/JUMP; i++) {
VECNF().LOAD(&x[JUMP*(i+0)]).STORE(&y[JUMP*(i+0)]);
}
}
void copy_unroll8(const float *x, float *y, const int n) {
for(int i=0; i<n/JUMP; i+=8) {
VECNF().LOAD(&x[JUMP*(i+0)]).STORE(&y[JUMP*(i+0)]);
VECNF().LOAD(&x[JUMP*(i+1)]).STORE(&y[JUMP*(i+1)]);
VECNF().LOAD(&x[JUMP*(i+2)]).STORE(&y[JUMP*(i+2)]);
VECNF().LOAD(&x[JUMP*(i+3)]).STORE(&y[JUMP*(i+3)]);
VECNF().LOAD(&x[JUMP*(i+4)]).STORE(&y[JUMP*(i+4)]);
VECNF().LOAD(&x[JUMP*(i+5)]).STORE(&y[JUMP*(i+5)]);
VECNF().LOAD(&x[JUMP*(i+6)]).STORE(&y[JUMP*(i+6)]);
VECNF().LOAD(&x[JUMP*(i+7)]).STORE(&y[JUMP*(i+7)]);
}
}
Where VECNF().LOAD
is _mm_load_ps()
for SSE or _mm256_load_ps()
for AVX, VECNF().STORE
is _mm_store_ps()
for SSE or _mm256_store_ps()
for AVX, and JUMP is 4 for SSE or 8 for AVX.
For the large size the best result is obtained by using non-temporal store instructions and by using multiple threads. Contrary to what many people may believe a single thread does NOT usually saturate the memory bandwidth.
void copy_stream(const float *x, float *y, const int n) {
#pragma omp parallel for
for(int i=0; i<n/JUMP; i++) {
VECNF v = VECNF().load_a(&x[JUMP*i]);
stream(&y[JUMP*i], v);
}
}
Where stream
is _mm_stream_ps()
for SSE or _mm256_stream_ps()
for AVX
Here are the memcpy
results on my E5-1620@3.6 GHz with four threads for 1 GB with a maximum main memory bandwidth of 51.2 GB/s.
GB/s efficiency
eglibc: 23.6 46%
asmlib: 36.7 72%
copy_stream: 36.7 72%
Once again EGLIBC performs poorly. This is because it does not use non-temporal stores.
I modfied the eglibc
and asmlib
memcpy
functions to run in parallel like this
void COPY(const float * __restrict x, float * __restrict y, const int n) {
#pragma omp parallel
{
size_t my_start, my_size;
int id = omp_get_thread_num();
int num = omp_get_num_threads();
my_start = (id*n)/num;
my_size = ((id+1)*n)/num - my_start;
memcpy(y+my_start, x+my_start, sizeof(float)*my_size);
}
}
A general memcpy
function needs to account for arrays which are not aligned to 64 bytes (or even to 32 or to 16 bytes) and where the size is not a multiple of 32 bytes or the unroll factor. Additionally, a decision has to be made as to when to use non-temporal stores. The general rule of thumb is to only use non-temporal stores for sizes larger than half the largest cache level (usually L3). But theses are "second order" details which I think should be dealt with after optimizing for ideal cases of large and small. There's not much point in worrying about correcting for misalignment or non-ideal size multiples if the ideal case performs poorly as well.
Update
Based on comments by Stephen Canon I have learned that on Ivy Bridge and Haswell it's more efficient to use rep movsb
than movntdqa
(a non-temporal store instruction). Intel calls this enhanced rep movsb (ERMSB). This is described in the Intel Optimization manuals in the section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB).
Additionally, in Agner Fog's Optimizing Subroutines in Assembly manual in section 17.9 Moving blocks of data (All processors) he writes:
"There are several ways of moving large blocks of data. The most common methods are:
- REP MOVS instruction.
- If data are aligned: Read and write in a loop with the largest available register size.
- If size is constant: inline move instructions.
- If data are misaligned: First move as many bytes as required to make the destination
aligned. Then read unaligned and write aligned in a loop with the largest available
register size.
- If data are misaligned: Read aligned, shift to compensate for misalignment and write
aligned.
- If the data size is too big for caching, use non-temporal writes to bypass the cache.
Shift to compensate for misalignment, if necessary."
A general memcpy
should consider each of these points. Additionally, with Ivy Bridge and Haswell it seems that point 1 is better than point 6 for large arrays. Different techniques are necessary for Intel and AMD and for each iteration of technology. I think it's clear that writing your own general efficient memcpy
function can be quite complicated. But in the special cases I have looked at I have already managed to do better than the GCC builtin memcpy
or the one in EGLIBC so the assumption that you can't do better than the standard libraries is incorrect.