garbage collection - What does git do when we do : git gc - git prune

Question

Welcome To Ask or Share your Answers For Others

garbage collection - What does git do when we do : git gc - git prune

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

garbage collection - What does git do when we do : git gc - git prune

What's going on in background when launching,

git gc
git prune

Output of git gc :

Counting objects: 945490, done. 
Delta compression using up to 4 threads.   
Compressing objects: 100% (334718/334718), done. 
Writing objects: 100%   (945490/945490), done. 
Total 945490 (delta 483105), reused 944529 (delta 482309) 
Checking connectivity: 948048, done.

Output of git prune :

Checking connectivity: 945490, done.

What is the difference between these two options?

Thank you

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:50:01+0000

TL;DR

git prune only removes loose, unreachable, stale objects (objects must have all three properties to get pruned). Unreachable packed objects remain in their pack files. Reachable loose objects remain reachable and loose. Objects that are unreachable, but are not yet stale, also remain untouched. The definition of stale is a little tricky (see details below).

git gc does more: it packs references, packs useful objects, expires reflog entries, prunes loose objects, prunes removed worktrees, and prunes / gc's old git rerere data.

Long

I'm not sure what you mean by "in the background" above (background has a technical meaning in shells and all of the activity here takes place in the shell's foreground but I suspect you did not mean these terms).

What git gc does is to orchestrate a whole series of collection activities, including but not limited to git prune. The list below is the set of commands run by a foreground gc without --auto (omitting their arguments, which depend to some extent on git gc arguments):

git pack-refs: compact references (turn .git/refs/heads/... and .git/refs/tags/... entries into entries in .git/packed-refs, eliminating the individual files)
git reflog expire: expire old reflog entries
git repack: pack loose objects into packed object format
git prune: remove unwanted loose objects
git worktree prune: remove worktree data for added worktrees that the user has deleted
git rerere gc: remove old rerere records

There are a few more individual file activities git gc does on its own, but the above is the main sequence. Note that git prune happens after (1) expiring reflogs and (2) running git repack: this is because an expired reflog entry that is removed may cause an object to become unreferenced, and hence not get packed and then get pruned so that it is completely gone.

Stuff to know before we look at repack and prune

Before going into any more detail, it's a good idea to define what an object is, in Git, and what it means for an object to be loose or packed. We also need to understand what it means for an object to be reachable.

Every object has a hash ID—one of those big ugly IDs you have seen in git log, for instance—that is that object's name, for retrieval purposes. Git stores all the objects in a key-value database where the name is the key, and the object itself is the value. Git's objects are therefore how Git stores files and commits, and in fact, there are four object types: A commit object holds an actual commit. A tree object holds sets of pairs,¹ a human-readable name like README or subdir along with another object's hash ID. That other object is a blob object if the name in the tree is a file name, or it is another tree object if the name is that of a subdirectory. The blob objects hold the actual file contents (but note that the name of the file is in the tree linking to the blob!). The last object type is annotated tag, used for annotated tags, which are not especially interesting here.

Once made, no object can ever be changed. This is because the object's name—it hash ID—is computed by looking at every single bit of the object's content. Change any one bit from a zero to a one or vice versa and the hash ID changes: you now have a different object, with a different name. This is how Git checks that no file has ever been messed-with: if the file contents were changed, the hash ID of the object would change. The object ID is stored in the tree entry, and if the tree object were changed, the tree's ID would change. The tree's ID is stored in the commit, and if the tree ID were changed, the commit's hash would change. So if you know that the commit's hash is a234b67... and the commit's content still hashes to a234b67..., nothing changed in the commit, and the tree ID is still valid. If the tree still hashes to its own name, its content is still valid, so the blob ID is correct; so as long as the blob content hashes to its own name, the blob is correct as well.

Objects can be loose, which means they are stored as files. The name of the file is just the hash ID.² The contents of the loose object are zlib-deflated. Or, objects can be packed, which means many objects are stored in a single pack-file. In this case the contents are not just deflated, they're first delta-compressed. Git picks out a base object—often the latest version of some blob (file)—and then finds additional objects that can be represented as a series of commands: take the base file, remove some text at this offset, add other text at another offset, and so on. The actual format of pack files is documented here, if a bit lightly. Note that unlike most version control systems, the delta-compression occurs at a level below the stored-object abstraction: Git stores whole snapshots, then does delta-compression later, on the underlying objects. Git still accesses an object by its hash-ID name; it's just that reading that object involves reading the pack file, finding the object and its underlying delta bases, and reconstructing the complete object on the fly.

There's a general rule about pack files that states that any delta-compressed object within a pack file must have all its bases in the same pack file. This means that a pack file is self-contained: there's never a need to open multiple additional pack files to get an object out of a pack that has the object. (This particular rule can be deliberately violated, producing what Git calls a thin pack, but those are intended to be used only to send objects over a network connection to another Git that already has the base objects. The other Git must "fix" or "fatten" the thin pack to make a normal pack file, before leaving it behind for the rest of Git.)

Object reachability is a little bit tricky. Let's look first at commit reachability.

Note that when we have a commit object, that commit object itself contains several hash IDs. It has one hash ID for the tree that holds the snapshot that goes with that commit. It also has one or more hash IDs for parent commits, unless this particular commit is a root commit. A root commit is defined as a commit with no parents, so this is a bit circular: a commit has parents, unless it has no parents. It's clear enough though: given some commit, we can draw that commit as a node in a graph, with arrows coming out of the node, one per parent:

<--o
   |
   v

These parent arrows point to the commit's parent or parents. Given a series of single-parent commits we get a simple linear chain:

... <--o  <--o  <--o ...

One of these commits must be the start of the chain: that's the root commit. One of these must be the end, and that's the tip commit. All of the internal arrows point backwards (leftwards) so we can draw this without the arrow-heads, knowing that the root is at the left and the tip is at the right:

o--o--o--o--o

Now we can add a branch name like master. The name simply points to the tip commit:

o--o--o--o--o   <--master

None of the arrows embedded within a commit can ever change, because nothing in any object can ever change. The arrow in the branch name master, however, is actually just the hash ID of some commit, and this can change. Let's use letters to represent the commit hashes:

A--B--C--D--E   <-- master

the name master now just stores the commit hash of commit E. If we add a new commit to master, we do this by writing out a commit whose parent is E and whose tree is our snapshot, giving us an all-new hash, which we can call F. Commit F points back to E. We have Git write F's hash ID into master and now we have:

A--B--C--D--E--F   <-- master

We added one commit and changed one name, master. All the previous commits are reachable by starting at the name master. We read out the hash ID of F and read commit F. This has the hash ID of E, so we have reached commit E. We read E to get the hash ID of D, and thus reach D. We repeat until we read A, find that it has no parent, and are done.

If there are branches, that just means that we have commits found by another name whose parents are one of the commits also found by the name master:

A--B--C--D--E--F   <-- master
             
              G--H   <-- develop

The name develop locates commit H; H finds G; and G refers back to E. So all of these commits are reachable.

Commits with more than one parent—i.e., merge commits—make all their parents reachable if the commit itself is reachable. So once you make a merge commit, you can (but do not have to) delete the branch name that identifies the commit that was merged-in: it's now reachable from the tip of the branch that you were on when you did the merge operation. That is:

...--o--o---o   <-- name
          /
       o--o   <-- delete-able

the commits on the bottom row here are reachable from name, through the merge, just as the commits on the top row were always reachable from name. Deleting the name delete-able leaves them still reachable. If the merge commit is not there, as in this case:

...--o--o   <-- name2
      
       o--o   <-- not-delete-able

then deleting not-delete-able effectively abandons the two commits along the bottom row: they become unreachable, and hence eligible for garbage-collection.

This same reachability property applies to tree and blob objects. Commit G has a tree in it, for instance, and this tree has <name, ID> pairs:

A--B--C--D--E--F   <-- master
             
              G--H   <-- develop
              |
         tree=d097...
            /   
 README=9fa3... Makefile=0b41...

So from commit G, tree object d097... is reachable; from that tree, blob object 9fa3... is reachable, and so is blob object 0b41.... Commit H might have the very same README object, under the same name (though a different tree): that's fine, that just makes

Categories

garbage collection - What does git do when we do : git gc - git prune

garbage collection - What does git do when we do : git gc - git prune

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

TL;DR

Long

Stuff to know before we look at repack and prune

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags