Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
237 views
in Technique[技术] by (71.8m points)

python - Multiprocessing: use only the physical cores?

I have a function foo which consumes a lot of memory and which I would like to run several instances of in parallel.

Suppose I have a CPU with 4 physical cores, each with two logical cores.

My system has enough memory to accommodate 4 instances of foo in parallel but not 8. Moreover, since 4 of these 8 cores are logical ones anyway, I also do not expect using all 8 cores will provide much gains above and beyond using the 4 physical ones only.

So I want to run foo on the 4 physical cores only. In other words, I would like to ensure that doing multiprocessing.Pool(4) (4 being the maximum number of concurrent run of the function I can accommodate on this machine due to memory limitations) dispatches the job to the four physical cores (and not, for example, to a combo of two physical cores and their two logical offsprings).

How to do that in python?

Edit:

I earlier used a code example from multiprocessing but I am library agnostic ,so to avoid confusion, I removed that.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I know the topic is quite old now, but as it still appears as the first answer when typing 'multiprocessing logical core' in google... I feel like I have to give an additionnal answer because I can see that it would be possible for people in 2018 (or even later..) to get easily confused here (some answers are indeed a little bit confusing)

I can see no better place than here to warn readers about some of the answers above, so sorry for bringing the topic back to life.

--> TO COUNT THE CPUs (LOGICAL/PHYSICAL) USE THE PSUTIL MODULE

For a 4 physical core / 8 thread i7 for ex it will return

import psutil 
psutil.cpu_count(logical = False)

4

psutil.cpu_count(logical = True)

8

As simple as that.

There you won't have to worry about the OS, the platform, the hardware itself or whatever. I am convinced it is much better than multiprocessing.cpu_count() which can sometimes give weird results, from my own experience at least.

--> TO USE N PHYSICAL CORE (up to your choice) USE THE MULTIPROCESSING MODULE DESCRIBED BY YUGI

Just count how many physical processes you have, launch a multiprocessing.Pool of 4 workers.

Or you can also try to use the joblib.Parallel() function

joblib in 2018 is not part of the standard distribution of python, but is just a wrapper of the multiprocessing module that was described by Yugi.

--> MOST OF THE TIME, DON'T USE MORE CORES THAN AVAILABLE (unless you have benchmarked a very specific code and proved it was worth it)

We can hear here and there (also from some people answering here) that "the OS will take care properly is you use more core than available". It is absolutely 100% false. If you use more core than available, you will face huge performance drops. Because the OS scheduler will try its best to work on every task with the same attention, switching regularly from one to another, and depending on the OS, it can spend up to 100% of its working time to just switching between processes, which would be disastrous.

Don't just trust me : try it, benchmark it, you will see how clear it is.

IS IT POSSIBLE TO DECIDE WETHER THE CODE WILL BE EXECUTED ON LOGICAL OR PHYSICAL CORE ?

If you are asking this question, this means you don't understand the way physical and logical cores are designed, so maybe you should check a little bit more about a processor's architecture.

If you want to run on core 3 rather than core 1 for example, Well I guess there are indeed some solutions, but available only if you know how to code an OS's kernel and scheduler, which I think is not the case if you're asking this question.

If you launch 4 CPU-intensive processes on a 4 physical / 8 logical processor, the scheduler will attribute each of your processes to 1 distinct physical core (and 4 logical core will remain not/poorly used). But on a 4 logical / 8 thread proc, if the processing units are (0,1) (1,2) (2,3) (4,5) (5,6) (6,7), then it makes no difference if the process is executed on 0 or 1 : it is the same processing unit.

From my knowledge at least (but an expert could confirm / infirm, maybe it differs from very specific hardware specifications also) I think there is no or very little difference between executing a code on 0 or 1. In the processing unit (0,1), I am not sure that 0 is the logical whereas 1 is the physical, or vice-versa. From my understanding (which can be wrong), both are processors from the same processing unit, and they just share their cache memory / access to the hardware (RAM included), and 0 is not more a physical unit than 1.

More than that you should let the OS decide. Because the OS scheduler can take advantage of a hardware logical-core turbo boost that exist on some platforms (ex i7, i5, i3...), something else that you have no power on, and that could be truly helpfull to you.

If you launch 5 CPU-intensive tasks on a 4 physical / 8 logical core, the behaviour will be chaotic, almost unpredictable, mostly dependant of your hardware and OS. The scheduler will try its best. Almost every time, you will have to face really bad performances.

Let's presume for a moment that we are still talking about a 4(8) classical architecture: Because the scheduler tries its best (and therefore often switches the attributions), depending on the process you are executing, it could be even worse to launch on 5 logical cores than on 8 logical cores (where at least he knows everything will be used at 100% anyway, so lost for lost he won't try much to avoid it, won't switch too often, and therefore won't lose too much time by switching).

It is 99% sure however (but benchmark it on your hardware to be sure) that almost any multiprocessing program will run slower if you use more physical core than available.

A lot of things can intervene... The program, the hardware, the state of the OS, the scheduler it uses, the fruit you ate this morning, your sister's name... In case you doubt about something, just benchmark it, there is no other easy way to see wether you are losing performances or not. Sometimes informatics can be really weird.

--> MOST OF THE TIME, ADDITIONNAL LOGICAL CORES ARE INDEED USELESS IN PYTHON (but not always)

There are 2 main ways of doing really parallel tasks in python.

  • multiprocessing (cannot take advantage of logical cores)
  • multithreading (can take advantage of logical cores)

For example to run 4 tasks in parallel

--> multiprocessing will create 4 different python interpreter. For each of them you have to start a python interpreter, define the rights of reading/writing, define the environment, allocate a lot of memory, etc. Let's say it as it is: You will start a whole new program instance from 0. It can take a hudge amount of time, so you have to be sure that this new program will work long enough so that it is worth it.

If your program has enough work (let's say, a few seconds of work at least), then because the OS allocates CPU-consumming processes on different physical cores, it works, and you can gain a lot of performances, which is great. And because the OS almost always allows processes to communicate between them (although it is slow) they can even exchange (a little bit of) data.

--> multithreading is different. Within your python interpreter, it will just create a small amount of memory that many CPU will be available to share, and work on it at the same time. It is WAY much quicker to spawn (where spawning a new process on an old computer can take many seconds sometimes, spawning a thread is done within a ridiculously small fraction of time). You don't create new processes, but "threads" which are much lighter.

Threads can share memory between threads very quickly, because they literally work together on the same memory (while it has to be copied/exchanged when working with different processes).

BUT: WHY CANNOT WE USE MULTITHREADING IN MOST SITUATIONS ? IT LOOKS VERY CONVENIENT ?

There is a very BIG limitation in python: Only one python line can be executed at a time in a python interpreter, which is called the GIL (Global Interpreter Lock). So most of the time, you will even LOSE performances by using multithreading, because different threads will have to wait to access to the same resource. Multithreading is always USELESS and even WORSE if your code is pure python.

--> WHY SHOULDN'T I USE LOGICAL CORES WHEN USING MULTIPROCESSING ?

Logical cores don't have their own memory access. They can only work on the memory access and on the cache of its hosting physical processor. For example it is very likely (and often used indeed) that the logical and the physical core of a same processing unit both use the same C/C++ function on different emplacements of the cache memory at the same time. Making the treatment hugely faster indeed.

But... these are C/C++ functions ! Python is a big C/C++ wrapper, that needs much more memory and CPU than its equivalent C++ code. It is very likely in 2018 that, whatever you want to do, 2 big python processes will need much, much more memory and cache reading/writing than what a single physical+logical unit can afford, and much more that what the equivalent C/C++ truly-multithreaded code would consume. This once again, would almost always cause performances to drop. Remember that every variable that is not available in the processor's cache, will take x1000 time to read in the memory. If your cache is already completely full for 1 single python process, guess what will happend if you force 2 processes to use it: They will use it one at the time, and switch permanently, causing data to be stupidely flushed and re-read everytime it switche. When the data is beeing read or written from memory, you might think that your CPU "is" working but it's not. It's waiting for the data ! By doing nothing.

--> HOW CAN YOU TAKE ADVANTAGE OF LOGICAL CORES THEN ?

Like I said there is no true multithreading (so no true usage of logical cores) in default python, because of the global interpreter lock. You can force the GIL to be removed during some parts of the program, but I think it would be a wise advise that you don't touch to it if you don't know exactly what you are doing.

Removing the GIL definitely has been a subject of a lot of research (see the experimental PyPy or Cython projects that both try to do so).

For now, no real solution exists for it, as it is a much more complex problem than it seems.

There is, I admit, another solution that can work: - Code your function in C - Wrap it in python with ctype - Use the python multithreading module to call your wrapped C function

This will work 100%, and you will be able to use all the logical cores, in python, with multithreading, and for real. The GIL won't bother you, because you won't be executing true python functions, but C functions instead.

For example, some libraries like Numpy can work on all available threads, because they are coded in C. But if you come to this point, I always thought it could be wise to think about doing your program in C/C++ directly because it is a consideration very far from the original pythonic spirit.

**--> DON'T ALWAYS USE ALL AVAILABLE PHYSICAL CORES **

I often see people be like "Ok I have 8 physical core, so I will take 8 core for my job". It often works, but sometimes turns out to be a poor idea, especially if your job needs a lot of I/O.

Try with N-1 cores (once again, especially for highly I/O-demanding tasks), and you will see that 100% of time, on per-task/average, single tasks will always run faster on N-1 core. Indeed, your computer makes a lot of different things: USB, mouse, keyboard, network, Hard drive, etc... Even on a working station, periodical tasks are performed anytime in the background that you have no idea about. If you don't let 1 physical core to manage those tasks, your calculation will be regularly interrupted (flushed out from the memory / replaced back in memory) which can also lead to performance issues.

You might think "Well, background tasks will use only 5% of CPU-time so there is 95% left". But it's not the case.

The processor handles one task at a time. And everytime it switches, a considerably high amount of time is wasted to place everything back at its place in the memory cache/registries. Then, if for


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...