python - why is multiprocess Pool slower than a for loop?

Question

Welcome To Ask or Share your Answers For Others

python - why is multiprocess Pool slower than a for loop?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - why is multiprocess Pool slower than a for loop?

from multiprocessing import Pool

def op1(data):
    return [data[elem] + 1 for elem in range(len(data))]
data = [[elem for elem in range(20)] for elem in range(500000)]

import time

start_time = time.time()
re = []
for data_ in data:
    re.append(op1(data_))

print('--- %s seconds ---' % (time.time() - start_time))

start_time = time.time()
pool = Pool(processes=4)
data = pool.map(op1, data)

print('--- %s seconds ---' % (time.time() - start_time))

I get a much slower run time with pool than I get with for loop. But isn't pool supposed to be using 4 processors to do the computation in parallel?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T02:56:41+0000

Short answer: Yes, the operations will usually be done on (a subset of) the available cores. But the communication overhead is large. In your example the workload is too small compared to the overhead.

In case you construct a pool, a number of workers will be constructed. If you then instruct to map given input. The following happens:

the data will be split: every worker gets an approximately fair share;
the data will be communicated to the workers;
every worker will process their share of work;
the result is communicated back to the process; and
the main process groups the results together.

Now splitting, communicating and joining data are all processes that are carried out by the main process. These can not be parallelized. Since the operation is fast (O(n) with input size n), the overhead has the same time complexity.

So complexitywise even if you had millions of cores, it would not make much difference, because communicating the list is probably already more expensive than computing the results.

That's why you should parallelize computationally expensive tasks. Not straightforward tasks. The amount of processing should be large compared to the amount of communicating.

In your example, the work is trivial: you add 1 to all the elements. Serializing however is less trivial: you have to encode the lists you send to the worker.

Categories

python - why is multiprocess Pool slower than a for loop?

python - why is multiprocess Pool slower than a for loop?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags