Friday, 26 January 2018

Detecting binary files in the history of a git repository

Git, VCSes and binary files

Git is famous and has become popular even in the enterprise/commercial environments. But Git is also infamous regarding storage of large and/or binary files that change often, in spite of the fact they can be efficiently stored. For large files there have been several attempts to fix the issue, with varying degree of success, the most successful being git-lfs and git-annex.

My personal view is that, contrary to many practices, is a bad idea to store binaries in any VCS. Still, this practice has been and still is in use in may projects, especially in closed source projects. I won't go into the reasons, and how legitimate they are, let's say that we might finally convince people that binaries should be removed from the VCS, git, in particular.

Since the purpose of a VCS is to make sure all versions of the stored objects are never lost, Linus designed git in such a way that knowing the exact hash of the tip/head of your git branch, it is guaranteed the whole history of that branch hasn't changed even if the repository was stored in a non-trusted location (I will ignore hash collisions, for practical reasons).

The consequence of this is that if the history is changed one bit, all commit hashes and history after that change will change also. This is what people refer to when they say they rewrite the (git) history, most often, in the context of a rebase.

But did you know that you could use git rebase to traverse the history of a branch and do all sorts of operations such as detecting all binary files that were ever stored in the branch?

Detecting any binary files, only in the current commit

As with everything on *nix, we start with some building blocks, and construct our solution on top of them. Let's first find all files, except the ones in .git:

find . -type f -print | grep -v '^\.\/\.git\/'
Then we can use the 'file' utility to list for non-text files:
(find . -type f -print | grep -v '^\.\/\.git\/' | xargs file )| egrep -v '(ASCII|Unicode) text'
And if there are any such file, then it means the current git commit is one that needs our attention, otherwise, we're fine.
(find . -type f -print | grep -v '^\.\/\.git\/' | xargs file )| egrep -v '(ASCII|Unicode) text' && (echo 'ERROR:' && git show --oneline -s) || echo OK
 Of course, we assume here, the work tree is clean.

Checking all commits in a branch

Since we want to make this an efficient process and we only care if the history contains binaries, and branches are cheap in git, we can use a temporary branch that can be thrown away after our processing is finalized.
Making a new branch for some experiments is also a good idea to avoid losing the history, in case we do some stupid mistakes during our experiment.

Hence, we first create a new branch which points to the exact same tip the branch to be checked points to, and move to it:
git checkout -b test_bins
Git has many commands that facilitate automation, and my case I want to basically run the chain of commands on all commits. For this we can put our chain of commands in a script:

cat > ../check_file_text.sh
#!/bin/sh

(find . -type f -print | grep -v '^\.\/\.git\/' | xargs file )| egrep -v '(ASCII|Unicode) text' && (echo 'ERROR:' && git show --oneline -s) || echo OK
then (ab)use 'git rebase' to execute that for us for all commits:
git rebase --exec="sh ../check_file_text.sh" -i $startcommit
After we execute this, the editor window will pop up, just save and exit. Assuming $startcommit is the hash of the first commit we know to be clean or beyond which we don't care to search for binaries, this will look in all commits since then.

Here is an example output when checking the newest 5 commits:

$ git rebase --exec="sh ../check_file_text.sh" -i HEAD~5
Executing: sh ../check_file_text.sh
OK
Executing: sh ../check_file_text.sh
OK
Executing: sh ../check_file_text.sh
OK
Executing: sh ../check_file_text.sh
OK
Executing: sh ../check_file_text.sh
OK
Successfully rebased and updated refs/heads/test_bins.

Please note this process can change the history on the test_bins branch, but that is why we used a throw-away branch anyway, right? After we're done, we can go back to another branch and delete the test branch.

$ git co master
Switched to branch 'master'

Your branch is up-to-date with 'origin/master'
$ git branch -D test_bins
Deleted branch test_bins (was 6358b91).
Enjoy!

3 comments:

sourcejedi said...
This comment has been removed by the author.
sourcejedi said...

I could be missing something, but this sounds like git-filter-branch :-P.

eddyp said...

@sourcejedi, I wonly need to detect and check if I would need filter-branch, before that I need to see if I need to do anything. The actual cleaning in some situations would require some project team coordination if the problem is already in the latest code.