Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
252 views
in Technique[技术] by (71.8m points)

python - Why does "multiprocessing.Pool" run endlessly on Windows?

I have defined the function get_content to crawl data from https://www.investopedia.com/. I tried get_content('https://www.investopedia.com/terms/1/0x-protocol.asp') and it worked. However, the process seems to run infinitely on my Windows laptop. I checked that it runs well on Google Colab and Linux laptops.

Could you please elaborate why my function does not work in this parallel setting?

import requests
from bs4 import BeautifulSoup
from multiprocessing import dummy, freeze_support, Pool
import os
core = os.cpu_count() # Number of logical processors for parallel computing
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
session = requests.Session() 
links = ['https://www.investopedia.com/terms/1/0x-protocol.asp', 'https://www.investopedia.com/terms/1/1-10net30.asp']

############ Get content of a word
def get_content(l):
    r = session.get(l, headers = headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    entry_name = soup.select_one('#article-heading_3-0').contents[0]
    print(entry_name)

############ Parallel computing 
if __name__== "__main__":
    freeze_support()
    P_d = dummy.Pool(processes = core)
    P = Pool(processes = core)   
    #content_list = P_d.map(get_content, links)
    content_list = P.map(get_content, links)

Update1: I run this code in JupyterLab from Anaconda distribution. As you can see from below screenshot, the status is busy all the time.

enter image description here

Update2: The execution of the code can finish in Spyder, but it still returns no output.

enter image description here

Update3: The code runs perfectly fine in Colab:

enter image description here

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Quite a bit to unpack here, but it basically all boils down to how python spins up a new process, and executes the function you want.

On *nix systems, the default way to create a new process is by using fork. This is great because it uses "copy-on-write" to give the new child process access to a copy of the parent's working memory. It is fast and efficient, but it comes with a significant drawback if you're using multithreading at the same time. Not everything actually gets copied, and some things can get copied in an invalid state (threads, mutexes, file handles etc). This can cause quite a number of problems if not handled correctly, and to get around those python can use spawn instead (also Windows doesn't have "fork" and must use "spawn").

Spawn basically starts a new interpreter from scratch, and does not copy the parent's memory in any way. Some mechanism must be used to give the child access to functions and data defined before it was created however, and python does this by having that new process basically import * from the ".py" file it was created from. This is problematic with interactive mode because there isn't really a ".py" file to import, and is the primary source of "multiprocessing doesn't like interactive" problems. Putting your mp code into a library which you then import and execute does work in interactive, because it can be imported from a ".py" file. This is also why we use the if __name__ == "__main__": line to separate any code you don't want to be re-executed in the child when the import occurs. If you were to spawn a new process without this, it could recursively keep spawning children (though there's technically a built-in guard for that specific case iirc).

Then with either start method, the parent communicates with the child over a pipe (using pickle to exchange python objects) telling it what function to call, and what the arguments are. This is why arguments must be picklable. Some things can't be pickled, which is another common source of errors in multiprocessing.

Finally on another note, the IPython interpreter (the default Spyder shell) doesn't always collect stdout or stderr from child processes when using "spawn", meaning print statements won't be shown. The vanilla (python.exe) interpreter handles this better.

In your specific case:

  • Jupyter lab is running in interactive mode, and the child process will have been created but gotten an error something like "can't import get_content from __main__". The error doesn't get displayed correctly because it didn't happen in the main process, and jupyter doesn't handle stderr from the child correctly
  • Spyder is using IPython, by default which is not relaying the print statements from the child to the parent. Here you can switch to the "external system console" in the "run" dialog, but you then must also do something to keep the window open long enough to read the output (prevent the process from exiting).
  • Google Colab is using a google server running Linux to execute your code rather than executing it locally on your windows machine, so by using "fork" as the start method, the particular issue of not having a ".py" file to import from is not an issue.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...