I'm using a machine having 8 cores and 32GB ram. In this machine, I'm running a code in c++ using VS2010 on Windows x64 which takes 3 days to complete 8 trees(8 is the number of outer threads). I searched for bottleneck and find out that crossCorrelate
method takes around 75-80% of the time. Now, I'm trying to make that method more efficient, code is as follows:
int main(){
int numThread = 8;
//create threads, run build_tree method for each of them
//and join after running all of them
}
// I'm creating 8 tree
void build_tree(int i){ //called millions of times
for(some_value to another_val){
//do some stuff
read_corresponding_matrices
crossCorrelate(mat1,mat2);
}
//write the results to a file
}
//each tree is working with its own data, no dependency between trees.
Mat crossCorrelate(Mat mat1_real, Mat mat2_real){
Mat mat1, mat2,result;
//1st multi-threading part // around 20 ms
Scalar mean1 = mean(mat1_real);
subtract(mat1_real,(float)mean1[0],mat1);
Scalar mean2 = mean(mat2_real);
subtract(mat2_real,(float)mean2[0],mat2);
//1st part ends
Mat tilted_mat2 = flip_cross(mat2);
Mat planes[] = {Mat_<float>(mat1), Mat::zeros(mat1.size(), CV_32F)};
Mat planes2[] = {Mat_<float>(tilted_mat2), Mat::zeros(mat1.size(), CV_32F)};
Mat complexI;
//2nd multi-threaded part //around 150 ms
merge(planes, 2, complexI);
dft(complexI, complexI);
split(complexI, planes);
merge(planes2, 2, complexI);
dft(complexI, complexI);
split(complexI, planes2);
//2nd m-t part ends
// do some operations with mat1, mat2, planes etc
clock_t s11 = clock();
cout << "total time diff " << s11-s1 << endl;
return result;
}
This is the method that I want to make more efficient. This part takes around 600 ms for each call. What I thought is to make some independent parts of the method multi-threaded and found two places that can be written in parallel.
For this aim, I wrote two simple code for each (1st and 2nd m-t parts), and run those methods:
t1 = boost::thread( subtract_mean, mat1_real, mat1);
subtract_mean(mat_ori, mat){
Scalar mean1 = mean(mat_ori);
subtract(mat_ori,(float)mean1[0],mat1);
}
similarly 2nd thread creates two thread for each dft.(dft_thread)
The code includes a lot of computations so, when I run it cpu usage becomes around 90%.
Before running with inner threads, I was expecting a better result however it is not.
Here are my question: Why does my code is working faster when I run without dft_thread
and sub_thread
? How can I make crossCorrelation faster? Could I use an inner thread, I used once, over and over by doing that would it make my code faster? Is there a clever way of inserting inner threads to my code?
EDIT: I did some new tests:
I have no inner thread and checked what happens when the number of outer threads are 1-2-4-6-8 for tree size = 16. Here are the results:
numThread 1 ------ 2 ------ 4 ------ 6 ------ 8
Time takes 29 ----- 35 ----- 51 ----- 77 ----- 104 (in sec)
avg_time 29 ---- 17.5 ---- 12.7 ---- 12.8 ---- 13 (in sec)
I think this shows, I can on make 2.5 time faster with threads. I was expecting/thinking it is 5-6 times faster with 8 thread. Is it what it should have been? Am I doing something wrong or my understanding of threads fails?
EDIT2: I did one more test:
First one: running the code with 6 thread
The second one is copy the visual studio project 5 times and run 6 process at the same time all of them are running with one thread
. (multithreading vs parallel processing)
multithreading takes 141 mins whereas, parallel processing takes 70 mins.
Note that: running one process with one thread takes 53 mins.
What could be the reason for that? Anybody seeing such an abnormal situation? I'm thinking both should be in the same speed (maybe multithreading is a bit more faster) as they are using same amount of resources, am I wrong?
Thanks,
See Question&Answers more detail:
os