Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
442 views
in Technique[技术] by (71.8m points)

c - slow sparse matrix vector product (CSR) using open mp

I am trying to speed up a sparse matrix-vector product using open mp, the code is as follows:

void zAx(double * z, double * data, long * colind, long * row_ptr, double * x, int M){

long i, j, ckey;
int chunk = 1000;
//int * counts[8]={0};
#pragma omp parallel num_threads(8)
{ 
  #pragma omp for private(ckey,j,i) schedule(static,chunk)
  for (i=0; i<M; i++ ){ 
    z[i]=0;
    for (ckey=row_ptr[i]; ckey<row_ptr[i+1]; ckey++) {
      j = colind[ckey];
      z[i] += data[ckey]*x[j];
    }              
  }
}
}

Now, this code runs fine, and produces the correct result, it however only gives me a speed up of ~30%. I have checked to see that the threads are all getting about the same number of non-zero elements (they are), and the matrix is fairly large (300,000 x 300,000), so I'm hoping the overhead isn't the only issue. I've also tried running with different chunk sizes and thread numbers, and I get similar performance.

Is there something else I could try to get a bit of extra speed out of this? Or something I'm obviously doing wrong?

Cheers.

Edit: Just commented out '//int * counts[8]={0}', because it was leftover from counting the work allocation. Not needed

Edit2 (more details):

Okay so I timed a loop of calling this 5000 times and got the average times:

  • seq: 0.0036 (seconds?)
  • 2 threads: 0.002613
  • 4 threads: 0.002308
  • 8 threads: 0.002384

The matrix is of size: 303544x303544 and has: 2122980 non-zero elements.

With a much smaller matrix 30000x30000 I get times that go more like

  • seq 0.000303
  • 8 threads 0.000078

So it seems the large size may be my issue.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Welcome to the wonderful world of memory-bound problems. To relieve you from your pain, I would like to inform you, that sparse matrix-vector multiplication is one of the many things that cannot be effectively parallelised or even vectorised on a single multi-core chip, unless all data could fit in the last level cache or the memory bus is really really wide.

Why? Simply because the ratio of computation to memory access is extremely low. For each iteration of the inner loop you fetch once the column index into j (8 bytes), the matrix element into data (8 bytes), the value of the vector element (8 bytes) and the previous value of the result (since compilers rarely optimise access to shared variables) (8 bytes). Then you perform 2 very fast floating point operations (FLOPs) and perform a store (although the += operator is translated to a single instruction, it is still a "fetch-modify-write" one). In total you load 32 bytes and you perform 2 FLOPs over them. This makes 1/16 FLOPs per byte.

A modern SSE-capable CPU core can perform 4 double-precision FLOPs/cycle, which often results to something like 8 GFLOPS per CPU core (assuming basic frequency of 2 GHz). With AVX the number is doubled, so you get up to 16 GFLOPS per core on a 2 GHz Intel Sandy/Ivy Bridge or AMD equivalent. In order to saturate this processing power with data, given the 1/16 FLOPs/byte, you would need at least 128 GiB/s of memory bandwidth.

A high-end Nehalem-EX processor like Xeon X7560 runs at 2,26 GHz (9,04 GFLOPS/core) and its shared L3 cache (L1 and L2 caches are per-core) delivers about 275 GiB/s. At 9,04 GFLOPS/core, you need 144,64 GiB/s per core in order to feed the inner loop of your zAx routine. This means that in the ideal case, the L3 cache of this CPU cannot feed more than 2 fully vectorised multiplication kernels.

Without SSE vectorisation the FLOPS rate is twice as lower for double precision, so one could expect the problem to scale up to 4 threads. Things get extremely bad once your problem becomes larger than the L3 cache as the memory bus delivers about ten times less bandwidth.

Try the following version of the inner loop just to see if the compiler is smart enough to follow the relaxed memory view of OpenMP:

#pragma omp for private(ckey,j) schedule(static,chunk)
for (i=0; i<M; i++){
  double zi = 0.0;
  for (ckey=row_ptr[i]; ckey<row_ptr[i+1]; ckey++) {
    j = colind[ckey];
    zi += data[ckey]*x[j];
  }
  z[i] = zi;
}

Unfortunately there is nothing more that you can do. Sparse matrix-vector multiplication scales with the number of CPU sockets, not with the number of CPU cores. You would need a multi-socket system with separate memory controllers, e.g. any system with more than one (post-)Nehalem or AMD64 processors.

Edit: optimisation hint. Do you really need long to store the column index and the row pointers? With 2122980 non-zero elements you would be pretty fine using int instead. It would save loading 4 bytes per element in the inner loop and another 4 bytes per row in the outer loop.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...