Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
573 views
in Technique[技术] by (71.8m points)

performance - In what circumstances can large pages produce a speedup?

Modern x86 CPUs have the ability to support larger page sizes than the legacy 4K (ie 2MB or 4MB), and there are OS facilities (Linux, Windows) to access this functionality.

The Microsoft link above states large pages "increase the efficiency of the translation buffer, which can increase performance for frequently accessed memory". Which isn't very helpful in predicting whether large pages will improve any given situation. I'm interested in concrete, preferably quantified, examples of where moving some program logic (or a whole application) to use huge pages has resulted in some performance improvement. Anyone got any success stories ?

There's one particular case I know of myself: using huge pages can dramatically reduce the time needed to fork a large process (presumably as the number of TLB records needing copying is reduced by a factor on the order of 1000). I'm interested in whether huge pages can also be a benefit in less exotic scenarios.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The biggest difference in performance will come when you are doing widely spaced random accesses to a large region of memory -- where "large" means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).

To make things more complex, the number of TLB entries for 4kB pages is often larger than the number of entries for 2MB pages, but this varies a lot by processor. There is also a lot of variation in how many "large page" entries are available in the Level 2 TLB.

For example, on an AMD Opteron Family 10h Revision D ("Istanbul") system, cpuid reports:

  • L1 DTLB: 4kB pages: 48 entries; 2MB pages: 48 entries; 1GB pages: 48 entries
  • L2 TLB: 4kB pages: 512 entries; 2MB pages: 128 entries; 1GB pages: 16 entries

While on an Intel Xeon 56xx ("Westmere") system, cpuid reports:

  • L1 DTLB: 4kB pages: 64 entries; 2MB pages: 32 entries
  • L2 TLB: 4kB pages: 512 entries; 2MB pages: none

Both can map 2MB (512*4kB) using small pages before suffering level 2 TLB misses, while the Westmere system can map 64MB using its 32 2MB TLB entries and the AMD system can map 352MB using the 176 2MB TLB entries in its L1 and L2 TLBs. Either system will get a significant speedup by using large pages for random accesses over memory ranges that are much larger than 2MB and less than 64MB. The AMD system should continue to show good performance using large pages for much larger memory ranges.

What you are trying to avoid in all these cases is the worst case (note 1) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (note 2) work, it requires:

  • 5 trips to memory to load data mapped on a 4kB page,
  • 4 trips to memory to load data mapped on a 2MB page, and
  • 3 trips to memory to load data mapped on a 1GB page.

In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information. The best description I have seen is in Section 5.3 of AMD's "AMD64 Architecture Programmer’s Manual Volume 2: System Programming" (publication 24593) http://support.amd.com/us/Embedded_TechDocs/24593.pdf

Note 1: The figures above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.

Note 2: Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...