Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
900 views
in Technique[技术] by (71.8m points)

objective c - iOS Concurrency - Not reaching anywhere's near theoretical maximum

I'm new to Grand Central Dispatch and have been running some tests with it doing some processing on an image. Basically I'm running a grayscale algorithm both sequentially and using GCD and comparing the results.

here is the basic loop:

UInt8 r,g,b;
uint pixelIndex;
for (uint y = 0; y < height; y++) {
    for (uint x = 0; x < width; x++) {
        pixelIndex = (uint)(y * width + x);

        if (pixelIndex+2 < width * height) {
            sourceDataPtr = &sourceData[pixelIndex];

            r = sourceDataPtr[0+0];
            g = sourceDataPtr[0+1];
            b = sourceDataPtr[0+2];

            int value = (r+g+b) / 3;
            if (value > MAX_COLOR_VALUE) {
                value = MAX_COLOR_VALUE;
            }

            targetData[pixelIndex] = value;
            self.imageData[pixelIndex] = value;
        }
    }
}

It simply runs through and takes the average value for Red, Green & Blue and uses that for the gray value. Very Simple. Now the parallel version basiclaly breaks the image into portions and then computes those portions seperately. Namely 2, 4, 8, 16 & 32 portions. I'm using the basic GCD so pass each portion in as it's own block to run concurrently. Here is the GCD wrapped code:

dispatch_group_t myTasks = dispatch_group_create();

for (int startX = 0; startX < width; startX += width/self.numHorizontalSegments) {
    for (int startY = 0; startY < height; startY += height/self.numVerticalSegments) {
        // For each segment, enqueue a block of code to compute it.
        dispatch_group_async(myTasks, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^{
             // grayscale code...
        });
    }
}
dispatch_group_wait(myTasks, DISPATCH_TIME_FOREVER); 

Everything is working fine. But what I am not understanding is the speedup / CPU usage. Running tests in the simulator (which is using my dual core CPU) I am getting:

  • ~0.0945s run time sequentially
  • ~0.0675s run time using GCD

This is a speedup of around ~28% (a.k.a taking 72% the time of the sequential version). Theoretically, on a 2-core machine 100% speedup is the maximum. So this is falling well short of that and I can't figure out why.

I monitor the CPU usage and it maxes out around 118% - why is it not reaching closer to 200%? If anyone has an idea as to what I should change, or what is the culprit here I would greatly appreciate it.

My Theories:

  • Not enough work on CPU (but image is ~3,150,000 pixels)
  • Not enough time to fire up to near 200%? Maybe each thread requires a longer runtime before it starts chewing up that much of the CPU?
  • I thought maybe the overhead was pretty high, but a test of launching 32 empty blocks to a queue (also in a group) took around ~0.0005s maximum.
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Most likely guess. In the single-threaded case, you are CPU bound. In the multi-threaded case, you are memory bound. In other words, the two cores are reading the data from DRAM at the maximum bus bandwidth. As a result, the cores end up idling waiting for more data to process.

You can test my theory by doing a true luminance calculation:

int value = floor( 0.299 * red + 0.587 * green + 0.114 * blue );

That calculation will yield gray scale values in the range from 0 to 255, given 8-bit rgb values. It also gives the processors more work to do per pixel. If you change that line of code, the time for the single threaded case should increase somewhat. And, if I'm correct, then the multi-threaded case should show a better performance improvement, as a percentage of the single-threaded time.


I decided to run some benchmarks of my own, both on the simulator and on an iPad2. The structure of my code was as follows.

Single Threaded

start = TimeStamp();

for ( y = 0; y < 2048; y++ )
    for ( x = 0; x < 1536; x++ )
        computePixel();

end = TimeStamp();
NSLog( @"single      = %8.3lf msec", (end - start) * 1e3 );

Two Threads using GCD

dispatch_group_t tasks = dispatch_group_create();
dispatch_queue_t queue = dispatch_get_global_queue( DISPATCH_QUEUE_PRIORITY_HIGH, 0 );

start = TimeStamp();
dispatch_group_async( tasks, queue, 
^{
    topStart = TimeStamp();

    for ( y = 0; y < 1024; y++ )
        for ( x = 0; x < 1536; x++ )
            computePixel();

    topEnd = TimeStamp();
});

dispatch_group_async( tasks, queue, 
^{
    bottomStart = TimeStamp();

    for ( y = 1024; y < 2048; y++ )
        for ( x = 0; x < 1536; x++ )
            computePixel();

    bottomEnd = TimeStamp();
});

wait = TimeStamp();
dispatch_group_wait( tasks, DISPATCH_TIME_FOREVER );
end = TimeStamp();

NSLog( @"wait        = %8.3lf msec", (wait - start) * 1e3 );
NSLog( @"topStart    = %8.3lf msec", (topStart - start) * 1e3 );
NSLog( @"bottomStart = %8.3lf msec", (bottomStart - start) * 1e3 );
NSLog( @" " );
NSLog( @"topTime     = %8.3lf msec", (topEnd - topStart) * 1e3 );
NSLog( @"bottomeTime = %8.3lf msec", (bottomEnd - bottomStart) * 1e3 );
NSLog( @"overallTime = %8.3lf msec", (end - start) * 1e3 );

Here are my results.

Running (r+g+b)/3 on the simulator

2014-04-03 23:16:22.239 GcdTest[1406:c07] single      =   21.546 msec
2014-04-03 23:16:22.239 GcdTest[1406:c07]  
2014-04-03 23:16:25.388 GcdTest[1406:c07] wait        =    0.009 msec
2014-04-03 23:16:25.388 GcdTest[1406:c07] topStart    =    0.031 msec
2014-04-03 23:16:25.388 GcdTest[1406:c07] bottomStart =    0.057 msec
2014-04-03 23:16:25.389 GcdTest[1406:c07]  
2014-04-03 23:16:25.389 GcdTest[1406:c07] topTime     =   10.865 msec
2014-04-03 23:16:25.389 GcdTest[1406:c07] bottomeTime =   10.879 msec
2014-04-03 23:16:25.390 GcdTest[1406:c07] overallTime =   10.961 msec

Running (.299r + .587g + .114b) on the simulator

2014-04-03 23:17:27.984 GcdTest[1422:c07] single      =   55.738 msec
2014-04-03 23:17:27.985 GcdTest[1422:c07]  
2014-04-03 23:17:29.306 GcdTest[1422:c07] wait        =    0.008 msec
2014-04-03 23:17:29.307 GcdTest[1422:c07] topStart    =    0.054 msec
2014-04-03 23:17:29.307 GcdTest[1422:c07] bottomStart =    0.060 msec
2014-04-03 23:17:29.307 GcdTest[1422:c07]  
2014-04-03 23:17:29.308 GcdTest[1422:c07] topTime     =   28.881 msec
2014-04-03 23:17:29.308 GcdTest[1422:c07] bottomeTime =   29.330 msec
2014-04-03 23:17:29.308 GcdTest[1422:c07] overallTime =   29.446 msec

Running (r+g+b)/3 on the iPad2

2014-04-03 23:27:19.601 GcdTest[13032:907] single      =  298.799 msec
2014-04-03 23:27:19.602 GcdTest[13032:907]  
2014-04-03 23:27:20.536 GcdTest[13032:907] wait        =    0.060 msec
2014-04-03 23:27:20.537 GcdTest[13032:907] topStart    =    0.246 msec
2014-04-03 23:27:20.539 GcdTest[13032:907] bottomStart =    2.906 msec
2014-04-03 23:27:20.541 GcdTest[13032:907]  
2014-04-03 23:27:20.542 GcdTest[13032:907] topTime     =  149.596 msec
2014-04-03 23:27:20.544 GcdTest[13032:907] bottomeTime =  149.209 msec
2014-04-03 23:27:20.545 GcdTest[13032:907] overallTime =  152.164 msec

Running (.299r + .587g + .114b) on the iPad2

2014-04-03 23:30:29.618 GcdTest[13045:907] single      =  282.767 msec
2014-04-03 23:30:29.620 GcdTest[13045:907]  
2014-04-03 23:30:34.008 GcdTest[13045:907] wait        =    0.046 msec
2014-04-03 23:30:34.010 GcdTest[13045:907] topStart    =    0.270 msec
2014-04-03 23:30:34.011 GcdTest[13045:907] bottomStart =    3.043 msec
2014-04-03 23:30:34.013 GcdTest[13045:907]  
2014-04-03 23:30:34.014 GcdTest[13045:907] topTime     =  143.078 msec
2014-04-03 23:30:34.015 GcdTest[13045:907] bottomeTime =  143.249 msec
2014-04-03 23:30:34.017 GcdTest[13045:907] overallTime =  146.350 msec

Running ((.299r + .587g + .114b) ^ 2.2) on the iPad2

2014-04-03 23:41:28.959 GcdTest[13078:907] single      = 1258.818 msec
2014-04-03 23:41:28.961 GcdTest[13078:907]  
2014-04-03 23:41:30.768 GcdTest[13078:907] wait        =    0.048 msec
2014-04-03 23:41:30.769 GcdTest[13078:907] topStart    =    0.264 msec
2014-04-03 23:41:30.771 GcdTest[13078:907] bottomStart =    3.037 msec
2014-04-03 23:41:30.772 GcdTest[13078:907]  
2014-04-03 23:41:30.773 GcdTest[13078:907] topTime     =  635.952 msec
2014-04-03 23:41:30.775 GcdTest[13078:907] bottomeTime =  634.749 msec
2014-04-03 23:41:30.776 GcdTest[13078:907] overallTime =  637.829 msec

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...