caching - How does store to load forwarding happens in case of unaligned memory access?

Question

Welcome To Ask or Share your Answers For Others

caching - How does store to load forwarding happens in case of unaligned memory access?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

caching - How does store to load forwarding happens in case of unaligned memory access?

I know the load/store queue architecture to facilitate store to load forwarding and disambiguation of out-of-order speculative loads. This is accomplished using matching load and store addresses.

This matching address technique will not work if the earlier store is to unaligned address and the load depends on it. My question is if this second load issued out-of-order how it gets disambiguated by earlier stores? or what policies modern architectures use to handle this condition?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:24:58+0000

Short

The short answer is that it depends on the architecture, but in theory unaligned operations don't necessarily prevent the architecture from performing store forwarding. As a practical matter, however, the much larger number of forwarding possibilities that unaligned loads operations represent means that forwarding from such locations may not be supported at all, or may be less well supported than the aligned cases.

Long

The long answer is that any particular architecture will have various scenarios they can handle efficiently, and those they cannot.

Old or very simple architectures may not have any store-forwarding capabilities at all. These architectures may not execute out of order at all, or may have some out-of-order capability but may simply wait until all prior stores have committed before executing any load.

The next level of sophistication is an architecture that at least has some kind of CAM to check prior store addresses. This architecture may not have store forwarding, but may allow loads to execute in-order or out-of-order once the load address and all prior store addresses are known (and there is no match). If there is a match with a prior store, the architecture may wait until the store commits before executing the load (which will read the stored value from the L1, if any).

Next up, we have architecture likes the above that wait until prior store addresses are known and also do store forwarding. The behavior is the same as above, except that when a load address hits a prior store, the store data is forwarded to the load without waiting for it to commit to L1.

A big problem with the above is that in the above designs, loads still can't execute until all prior store addresses are known. This inhibits out-of-order execution. So next up, we add speculation - if a load at a particular IP has been observed to not depend on prior stores, we just let it execute (read its value) even if prior store addresses aren't know. At retirement there will be a second check to ensure than the assumption that there was no hit to a prior store was correct, and if not there will be some type of pipeline clean and recovery. Loads that are predicted to hit a prior store wait until the store data (and possibly address) is available since they'll need store-forwarding.¹

That's kind of where we are at today. There are yet more advanced techniques, many of which fall under the banner of memory renaming, but as far as I know they are not widely deployed.

Finally, we get to answer your original question: how all of this interacts with unaligned loads. Most of the above doesn't change - we only need to be more precise about what the definition of a hit is, where a load reads data from a previous store above.

You have several scenarios:

A later load is totally contained within a prior store. This means that all the bytes read by a load come from the earlier store.
A later load is partially contained within a prior store. This means that one or more bytes of the load come from an earlier store, but one or more bytes do not.
A later load is not contained at all within any earlier store.

On most platforms, all three possible scenarios exist regardless of alignment. However, in the case of aligned values, the second case (partial overlap) can only occur when a larger store follows a smaller load, and if the platform only supports once size of loads situation (2) is not supported at all.

Theoretically, direct¹ store-to-load forwarding is possible in scenario (1), but not in scenarios (2) or (3).

To catch many practical cases of (1), you only need to check that the store and load addresses are the same, and that the load is not larger than the store. This still misses cases where a small load is fully contained in a larger store, whether aligned or not.

Where alignment helps is that the checks above are easier: you need to compare fewer bits of the addresses (e.g., a 32-bit load can ignore the bottom two bits of the address), and there are fewer possibilities to compare: a 4-byte load can only be contained in an 8-byte store in two possible ways (at the store address or the store address + 4), while misaligned operations can be fully contained in five different ways (at a load address offset any of 0,1,2,3 or 4 bytes from the store).

These differences are important in hardware, where the store queue has to look something like a fully-associative CAM implementing these comparisons. The more general the comparison, the more hardware is needed (or the longer the latency to do a lookup). Early hardware may have only caught the "same address" cases of (1), but the trend is towards catching more cases, both aligned and unaligned. Here is a great overview.

¹ How best to do this type of memory-dependence speculation is something that WARF holds patents and based on which it is actively suing all sorts of CPU manufacturers.

² By direct I mean from a single store to a following store. In principle, you might also have more complex forms of store-forwarding that can take parts of multiple prior stores and forward them to a single load, but it isn't clear to me if current architectures implement this.

Categories

caching - How does store to load forwarding happens in case of unaligned memory access?

caching - How does store to load forwarding happens in case of unaligned memory access?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Short

Long

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags