language agnostic - Tri-Color Incremental Updating GC: Does it need to scan each stack twice?

Question

Welcome To Ask or Share your Answers For Others

language agnostic - Tri-Color Incremental Updating GC: Does it need to scan each stack twice?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

language agnostic - Tri-Color Incremental Updating GC: Does it need to scan each stack twice?

Let me give you a short introduction to a tri-color GC (in case somebody reads it who has never heard of it); if you don't care, skip it and jump to The Problem.

How a Tri-Color GC Works

In a tri-color GC an object has one out of three possible colors; white, gray and black. A tri-color GC can be described as follows:

All objects are initially white.
All objects reachable because a global variable or a stack variable refers to it ("the root objects") are colored gray.
We take any gray object, find all references it has to white objects and color those white objects gray. Then we color the object itself black.
We continue at step 3 as long as we have gray objects.
If we have no gray objects any longer, all remaining objects are either white or black.
All black objects haven been proven to be reachable and must stay alive. All white objects are unreachable and can be deleted.

So far this is not too complicated… at least if the GC is StW (Stop the World), meaning it will pause all threads while collecting garbage. If it is concurrent, a tri-color GC has an invariant that must hold true at all times:

A black object must not refer to a white object!

This holds true automatically for a StW GC, since every object that is colored black has been examined previously and all white objects it was pointing to were colored gray, thus a black object may only refer to other black objects or gray objects.

If threads are not paused, threads can execute code that would break this invariant. There are several ways how to prevent this:

Capture all read access to pointers and look if this read access is made to a white object. If it is, color that object gray immediately. If a ref to this object is now assigned to a black object, it won't matter, the object is gray and not white any longer (this implementation uses a read-barrier)
Capture all write access to pointers and look if the assigned object is white and the object it is assigned to is black. If so, color the white object gray. This is the more obviously way of doing things, but also needs a bit more processing time (this implementation uses a write-barrier)

Since read-accesses are much more common than write-accesses, even though the second possibility involves more processing time when the barrier is hit, it is called less often and such the favored one. A GC working like that is called an "incremental updating GC".

There is an alternative to both techniques, called SatB (Snapshot at the Beginning). This variation works slightly different, considering the fact that it is not really necessary to uphold the invariant at all times, since it does not matter if a black object refers to a white one as long as the GC knows that this white object used to be and still is accessible during the current GC cycle (either because there are still gray objects referring to this white object as well, or because the a ref to this white object is put onto an explicit stack that is also considered by the GC when it runs out of gray objects). SatB collectors are used more often in practice, because they have some advantages, but IMHO they are harder to implement.

I'm referring here to a incremental updating GC, that uses variant 2: Whenever the code tries to make a black object point to a white object, it immediately colors the object gray. That way this object won't be missed in the collection cycle.

The Problem

So much about tri-color GCs. But there is one thing I don't understand about tri-color GCs. Let's assume we have an object A, that is referred to by the stack and itself refers to an object B.

stack -> A.ref -> B

Now the GC starts a cycle, halts the thread, scans the stack and sees A as directly accessible, coloring A gray. Once it is done with scanning the whole stack, it unpauses the thread again and starts processing at step (3). Before it starts doing anything, it is preempted (can happen) and the thread runs again and executes the following code:

localRef = A.ref; // localRef points to B
A.ref = NULL;     // Now only the stack points to B
sleep(10000);     // Sleep for the whole GC cycle

Since the invariant has not been violated, B was white, but has not been assigned to a black object, the color of B has not changed, it is still white. A does not refer to B any longer, so while processing the "gray" A, B won't change its color and A will become black. At the end of the cycle, B is still white and looks like garbage. However, localRef is referring to B, thus it is not garbage.

The Question

Am I right, that a tri-color GC must scan the stack of each thread twice? Once at the very beginning, to identify root objects (getting color gray) and again before deleting the white objects, as those might be referenced by the stack, even though no other object refers to them any longer. No description of the algorithm I've seen so far mentioned anything about scanning the stack twice. They all only said, that when used concurrent, it is important that the invariant is enforced at all time, otherwise reachable objects are missed. But as far as I can see, that is not enough. The stack must be considered like a single big object and once scanned, the "stack is black" and every ref update of the stack must cause the object to be colored gray.

If that is really the case, using incremental updating may be more tricky than I initially thought and has some performance drawbacks, since stack changes are the most frequent ones of all.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:29:24+0000

A bit of terminology:

Let me give some names so that explanations are clearer.

A variable is any slot for data, which may contain a pointer and may change over time. This includes global variables, local variables, CPU registers and fields in allocated objects.

In a tricolor incremental or concurrent GC, there are three types of variables:

the true roots, which are always accessible (CPU registers, global variables);
the fast variables, which are scanned in a stop-the-world fashion;
the slow variables, which are handled with the colors. Slow variables are fields in colored objects.

The "true roots" and "fast variables" will be hereafter collectively called roots.

The application threads are called the mutators because they change the contents of variables.

With an incremental or concurrent GC, GC pauses occur regularly. The world is stopped (mutators are paused), and the roots are scanned. This scan reveals a number of references to colored objects. Object colors are adjusted accordingly (such white objects are made grey).

When the GC is incremental, some object scanning activity takes place: some grey objects are scanned (and painted black), greying referenced white objects. This activity (the "marking") is maintained for some time, but not necessarily as long as there are grey objects. At some point, the marking stops and the world is awakened. The GC is called "incremental" because the GC cycle is performed in small increments, interleaved with mutator activity.

In a concurrent GC, scanning of grey objects occurs concurrently with mutator activity. The world is then awakened as soon as the roots have been scanned. With a concurrent GC, access barriers are a tad complex to implement because they must handle concurrent access from the GC thread; but at a conceptual level, this is not very different from an incremental GC. A concurrent GC can be viewed as an optimization over incremental GC, which takes advantage of the presence of multiple CPU cores (a concurrent GC has little advantage over an incremental GC when there is only one core).

Roots need not be protected by an access barrier, since they are scanned with the world stopped. The GC mark phase ends when the following conditions are simultaneously met:

the roots have just been scanned;
all objects are either black or white, but not grey.

so this situation can occur only during a pause. At that point, the sweep phase begins, during which white objects are released. The sweep can be done incrementally or concurrently; objects created during the sweep are immediately painted black. When the sweep is finished, a new GC mark phase can take place: objects (which are all black at that point) are all repainted white (this is done atomically by simply changing the way color bits are interpreted).

Variable Classification:

With that being said, I can now answer to your question. With the description above, the question becomes: what are the roots ? This is actually up to the implementation; there are several possibilities, and trade-offs.

True roots must always be scanned; true roots are the CPU register contents and the global variables. Note that the stacks are not true roots; only the current stack frame pointer is.

Since fast variables are accessed without barriers, it is customary to make stack frames fast variables (i.e. roots as well). This is because while write accesses are rare system-wide, they are quite common in the local variables. It has been measured (on some Lisp programs) that about 99% of writes (of a pointer value) have a local variable as target.

Fast variables are often extended even further, in the case of a generational GC: the "young generation" consists in a special allocation area for new objects, limited in length, and scanned as fast variables. The bright side of fast variables is fast access (hence the name); the downside is that all these fast variables may be scanned only during a pause (the world is stopped). There is a trade-off on the size of the fast variables, which often translates to a limit on the young generation size. A larger young generation promotes average performance (by reducing the number of access barriers) at the cost of longer pauses.

At the other extreme, you may have no fast variable at all, and no root but the true roots. The stack frames are then handled as objects, each with their own color. Pauses are then minimal (a mere snapshot of a dozen register) but barriers must be used even for access to local variables. This is expensive, but has some advantages:

Hard guarantees on pause times can be made. This is difficult if stack frames are roots, because each new thread has its own stack, so the roots total size may grow up to arbitrary amounts as new threads are launched. If only CPU registers and global variables (no more than a few dozens in typical cases, and their number is known at compilation time) are roots, then pauses can be kept very short.
This allows for dynamic allocation of stack frames in the heap. This is needed if you play with co-routines and continuations, as with Scheme's call/cc primitive. In such a case, frames are no longer handled as a pure "stack". Proper handling of continuations in a GC-aware language mostly requires that function frames be allocated dynamically.

It is possible to make stack frames non-root while keeping a young generation as root. Guarantees on pause times can still be made (depending on the young generation size, which is fixed) and some trickery can be applied to make sure that stack frames are in the young generation when their function is active. This can ensure barrier-free access to local variables. None of this is really free, but it can be made efficient enough for most purposes.

Another Conceptual View:

Another way to view root-handling is the following: roots are the variables for which the tricolor rule (no black-to-white pointer) is not maintained at all times; these variables are allowed to be mutated without constraint. But they must be brought back in line regularly, by stopping the world and scanning them.

In practice, the mutators are racing with the GC. Mutators create new objects, and point to them; each pause creates new grey objects. In a concurrent or incremental GC, if you let the mutators play with roots for too long, then each pause may create a big batch of new grey objects. In the worst case, the GC cannot scan objects fast enough to keep up with the rate of grey object creation. This is an issue because white objects can be released only during the sweep phase, which is reached only if at some point the GC may complete its marking. A usual implementation strategy for an incremental GC is to scan grey objects, during each pause, for a total size which is proportional to the total size of roots. Thus, pause time remains bounded by the roots total size, and if the proportionality factor is well balanced then it can be guaranteed that the GC will ultimately terminate is marking phase and enter the sweeping phase.

In a concurrent GC, things are a bit more complex, because the mutators roam freely in the wild. A possible implementation would make a little bit of incremental marking while the world is still stopped.

Bibliography:

Garbage Collection: Algorithms for Automatic Dynamic Memory Management: a must-read book on garbage collection.

Categories

language agnostic - Tri-Color Incremental Updating GC: Does it need to scan each stack twice?

language agnostic - Tri-Color Incremental Updating GC: Does it need to scan each stack twice?

How a Tri-Color GC Works

The Problem

The Question

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags