multithreading - Can someone provide an easy explanation of how 'Full Fences' are implemented in .Net using Threading.MemoryBarrier?

Question

Welcome To Ask or Share your Answers For Others

multithreading - Can someone provide an easy explanation of how 'Full Fences' are implemented in .Net using Threading.MemoryBarrier?

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:34:52+0000

In a really strong memory model, emitting fence instructions would be unnecessary. All memory accesses would execute in order and all stores would be globally visible.

Memory fences are needed because current common architectures do not provide a strong memory model - x86/x64 can for example reorder reads relative to writes. (A more thorough source is "Intel? 64 and IA-32 Architectures Software Developer’s Manual, 8.2.2 Memory Ordering in P6 and More Recent Processor Families"). As an example from the gazillions, Dekker's algorithm will fail on x86/x64 without fences.

Even if the JIT produces machine code in which instructions with memory loads and stores are carefully placed, its efforts are useless if the CPU then reorders these loads and stores - which it can, as long as the illusion of sequential consistency is maintained for the current context/thread.

Risking oversimplification: it may help to visualize the loads and stores resulting from the instruction stream as a thundering herd of wild animals. As they cross a narrow bridge (your CPU), you can never be sure about the order of the animals, since some of them will be slower, some faster, some overtake, some fall behind. If at the start - when you emit the machine code - you partition them into groups by putting infinitely long fences between them, you can at least be sure that group A comes before group B.

Fences ensure the ordering of reads and writes. Wording is not exact, but:

a store fence "waits" for all outstanding store (write) operations to finish, but does not affect loads.
a load fence "waits" for all outstanding load (read) operations to finish, but does not affect stores.
a full fence "waits" for all store and load operations to finish. It has the effect that reads and writes before the fence will get executed before the writes and loads that are on the "other side of the fence" (come later than the fence).

What the JIT emits for a full fence, depends on the (CPU) architecture and what memory ordering guarantees it provides. Since the JIT knows exactly what architecture it runs on, it can issue the proper instruction(s).

On my x64 machine, with .NET 4.0 RC, it happens to be a lock or.

            int a = 0;
00000000  sub         rsp,28h 
            Thread.MemoryBarrier();
00000004  lock or     dword ptr [rsp],0 
            Console.WriteLine(a);
00000009  mov         ecx,1 
0000000e  call        FFFFFFFFEFB45AB0 
00000013  nop 
00000014  add         rsp,28h 
00000018  ret

Intel? 64 and IA-32 Architectures Software Developer’s Manual Chapter 8.1.2:

"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor."
memory-ordering instructions address this specific need. MFENCE could have been used as full barrier in the above case (at least in theory - for one, locked operations might be faster, for two it might result in different behavior). MFENCE and its friends can be found in Chapter 8.2.5 "Strengthening or Weakening the Memory-Ordering Model".

There are some more ways to serialize stores and loads, though they are either impractical or slower than the above methods:

In chapter 8.3 you can find full serializing instructions like CPUID. These serialize instruction flow as well: "Nothing can pass a serializing instruction and a serializing instruction cannot pass any other instruction (read, write, instruction fetch, or I/O)".
If you set up memory as strong uncached (UC), it will give you a strong memory model: no speculative or out-of order accesses will be allowed and all accesses will appear on the bus, therefore no need to emit an instruction. :) Of course, this will be a tad slower than usual.

...

So it depends on. If there was a computer with strong ordering guarantees, the JIT would probably emit nothing.

IA64 and other architectures have their own memory models - and thus guarantees of memory ordering (or lack of them) - and their own instructions/ways to deal with memory store/load ordering.

Categories

multithreading - Can someone provide an easy explanation of how 'Full Fences' are implemented in .Net using Threading.MemoryBarrier?

multithreading - Can someone provide an easy explanation of how 'Full Fences' are implemented in .Net using Threading.MemoryBarrier?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags