Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
952 views
in Technique[技术] by (71.8m points)

multithreading - Does x86_64 CPU use the same cache lines for communicate between 2 processes via shared memory?

As known all levels of cache L1/L2/L3 on modern x86_64 are virtually indexed, physically tagged. And all cores communicate via Last Level Cache - cache-L3 by using cache coherent protocol MOESI/MESIF over QPI/HyperTransport.

For example, Sandybridge family CPU has 4 - 16 way cache L3 and page_size 4KB, then this allows to exchange the data between concurrent processes which are executed on different cores via a shared memory. This is possible because cache L3 can't contain the same physical memory area as a page of process 1 and as a page of process 2 at the same time.

Does this mean that every time when the process-1 requests the same shared memory region, then the process-2 flushes its cache-lines of page into the RAM, and then process-1 loaded the same memory region as cache-lines of page in virtual space of process-1? It's really slow or processor uses some optimizations?

Does modern x86_64 CPU use the same cache lines, without any flushes, to communicate between 2 processes with different virtual spaces via a shared memory?

Sandy Bridge Intel CPU - cache L3:

  • 8 MB - cache size
  • 64 B - cache line size
  • 128 K - lines (128 K = 8 MB / 64 B)
  • 16-way
  • 8 K - number sets of ways (8 K = 128 K lines / 16-way)
  • 13 bits [18:6] - of virtual address (index) defines current set number (this is tag)
  • 512 K - each the same (virtual address / 512 K) compete for the same set (8 MB / 16-way)
  • low 19 bits - significant for determining the current set number

  • 4 KB - standard page size

  • only low 12 bits - the same in virtual and physical addresses for each address

We have 7 missing bits [18:12] - i.e. we need to check (7^2 * 16-way) = 1024 cache lines. This is the same as 1024-way cache - so this is very slow. Does this mean, that cache L3 is (physically indexed, physically tagged)?

Summary of missing bits in virtual address for tag (page size 8 KB - 12 bits):

  • L3 (8 MB = 64 B x 128 K lines), 16-way, 8 K sets, 13 bits tag [18:6] - missing 7 bits
  • L2 (256 KB = 64 B x 4 K lines), 8-way, 512 sets, 9 bits tag [14:6] - missing 3 bits
  • L1 (32 KB = 64 B x 512 lines), 8-way, 64 sets, 6 bits tag [11:6] - no missing bits

It should be:

  • L3 / L2 (physically indexed, physically tagged) used after TLB lookup
  • L1 (virtually indexed, physically tagged)

enter image description here

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is possible because cache L3 can't contain the same physical memory area as page of process 1 and as page of process 2 at the same time.

Huh what? If both processes have a page mapped, they can both hit in the cache for the same line of physical memory.

That's part of the benefit of Intel's multicore designs using large inclusive L3 caches. Coherency only requires checking L3 tags to find cache lines in E or M state in another core's L2 or L1 cache.

Getting data between two cores only requires writeback to L3. I forget where this is documented. Maybe http://agner.org/optimize/ or What Every Programmer Should Know About Memory?. Or for cores that don't share any level of cache, you need a transfer between different caches at the same level of the cache hierarchy, as part of the coherency protocol. This is possible even if the line is "dirty", with the new owner assuming responsibility for eventually writing-back the contents that don't match DRAM.


The same cache line mapped to different virtual addresses will always go in the same set of the L1 cache. See discussion in comments: L2 / L3 caches are physically-index as well as physically tagged, so aliasing is never a problem. (Only L1 could get a speed benefit from virtual indexing. L1 cache misses aren't detected until after address translation is finished, so the physical address is ready in time to probe higher level caches.)

Also note that the discussion in comments incorrectly mentions Skylake lowering the associativity of L1 cache. In fact, it's the Skylake L2 cache that's less associative than before (4-way, down from 8-way in SnB/Haswell/Broadwell). L1 is still 32kiB 8-way as always: the maximum size for that associativity that keeps the page-selection address bits out of the index. So there's no mystery after all.

Also see another answer to this question about HT threads on the same core communicating through L1. I said more about cache ways and sets there.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...