Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
471 views
in Technique[技术] by (71.8m points)

git - New repo with copied history of only currently tracked files

Our current repo has tens of thousands of commits and a fresh clone transfers nearly a gig of data (there are lots of jar files that have since been deleted in the history). We'd like to cut this size down by making a new repo that keeps the full history for just the files that are currently active in the repo, or possibly just modify the current repo to clear the deleted file history. But I'm not sure how to do this in a practical manor.

I've tried the script in Remove deleted files from git history:

for del in `cat deleted.txt`
do
    git filter-branch --index-filter "git rm --cached --ignore-unmatch $del" --prune-empty -- --all
    # The following seems to be necessary every time
    # because otherwise git won't overwrite refs/original
    git reset --hard
    git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
    git reflog expire --expire=now --all
    git gc --aggressive --prune=now
done;

But given that we have tens of thousands of deleted files in the history and tens of thousands of commits, running the script would take an eternity. I started running this for just ONE deleted file 2 hours ago and the filter-branch command is still running, it's going through each of the 40,000+ commits one at a time, and this is on a new Macbook pro with an SSD drive.

I've also read the page https://help.github.com/articles/remove-sensitive-data but this only works for removing single files.

Has anyone been able to do this? I really want to preserve history of currently tracked files, I'm not sure if the space savings benefit would be worth creating a new repo if we can't keep the history.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Delete everything and restore what you want

Rather than delete this-list-of-files one at a time, do the almost-opposite: delete everything and just restore the files you want to keep.

Like so:

# for unix

$ git checkout master
$ git ls-files > keep-these.txt
$ git filter-branch --force --index-filter 
  "git rm  --ignore-unmatch --cached -qr . ; 
  cat $PWD/keep-these.txt | tr '
' '' | xargs -d '' git reset -q $GIT_COMMIT --" 
  --prune-empty --tag-name-filter cat -- --all
# for macOS

$ git checkout master
$ git ls-files > keep-these.txt
$ git filter-branch --force --index-filter 
  "git rm  --ignore-unmatch --cached -qr . ; 
  cat $PWD/keep-these.txt | tr '
' '' | xargs -0 git reset -q $GIT_COMMIT --" 
  --prune-empty --tag-name-filter cat -- --all

It may be faster to execute.

Cleanup steps

Once the whole process has finished, then cleanup:

$ rm -rf .git/refs/original/
$ git reflog expire --expire=now --all
$ git gc --prune=now

# optional extra gc. Slow and may not further-reduce the repo size
$ git gc --aggressive --prune=now

Comparing the repository size before and after, should indicate a sizable reduction, and of course only commits that touch the kept files, plus merge commits - even if empty (because that's how --prune-empty works), will be in the history.

$GIT_COMMIT?

The use of $GIT_COMMIT seems to have caused some confusion, from the git filter-branch documentation (emphasis added):

The argument is always evaluated in the shell context using the eval command (with the notable exception of the commit filter, for technical reasons). Prior to that, the $GIT_COMMIT environment variable will be set to contain the id of the commit being rewritten.

That means git filter-branch will provide the variable at run time, it's not provided by you before hand. This can be demonstrated if there's any doubt using this no-op filter branch command:

$ git filter-branch --index-filter "echo current commit is $GIT_COMMIT"
Rewrite d832800a85be9ef4ee6fda2fe4b3b6715c8bb860 (1/xxxxx)current commit is d832800a85be9ef4ee6fda2fe4b3b6715c8bb860
Rewrite cd86555549ac17aeaa28abecaf450b49ce5ae663 (2/xxxxx)current commit is cd86555549ac17aeaa28abecaf450b49ce5ae663
...

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...