Reduce size of git repository on Bitbucket
After few months of (commit & push) for my project, the size of the repository gets increased gradually on Bitbucket! it’s about 1 GB, I tried to remove some databases folders that are not important to be added.
After searching I found most of suggestions is proposing :
git filter-branch -f --tree-filter 'rm -rf folder/subfolder' HEAD
After removing few folders I push the change to the repository by — force, as
git push origin master --force
I finally found that the repository gets larger every time I use those commands !!.
Visibly, the repository gets larger 2.5 GB!!
Any suggestion please ?
Depending on the suggestion below, I tried the following commands
(for all large files)
git filter-branch –index-filter “git rm -rf –cached –ignore-unmatch
$files” –tag-name-filter cat — –all
(remove the temporary history git-filter-branch otherwise leaves behind for a long time)
rm -rf .git/refs/original/
git reflog expire --all git gc --aggressive --prune
But the folder .git/objects has still a big size !!!!
One Solution collect form web for “Reduce size of git repository on Bitbucket”
OK, given your answer to your comment, we can now say what happened.
git filter-branch does is to copy (some or all of) your commits to new ones, then update the references. This means your repository gets bigger (not smaller), at least initially.
The commits that are copied are those reachable via the references given. In this case, the reference you gave is
HEAD (which git turns into “your current branch”, probably
master, but whatever your current branch was at the time of the
filter-branch command). If (and only if) the new copy is precisely, bit-for-bit identical to the original, then it actually is the original and there is no actual copy made (the original is reused instead). However, as soon as you make any change—such as removing
folder/subfolder, from that point on these really are copies.
The copied stuff is, in this case, smaller, because you’ve removed some items. (It’s generally not very much smaller since git compresses items pretty well.) But you’re still adding more stuff to the repository: new commits, which refer to new trees, which—fortunately—refer to the same old blobs (file objects) as before, just slightly fewer of them this time (the objects for the
folder/subfolder files are still in the repository, but the copied commits and tree-objects no longer refer to them).
Pictorially, at this point in the
filter-branch process, we now have both the old commits:
R--o--o---o--o <-- master \ / o--o <-- feature
and the new ones (I’ll assume
folder/subfolder appeared in the original root commit
R so that we have a copy
R'-o'-o'--o'-o' \ / o'-o'
filter-branch does now, at the end of the copying process, is re-point some references (branch and tag names, mainly). The ones it re-points are the ones you tell it to, by mentioning them as what the documentation calls “positive references”. In this case, if you were on
HEAD was another name for
master), the single positive reference you gave is
master … so that’s all
filter-branch re-points. It also makes backup references whose name starts with
refs/original/. This means you now have the following commits:
R--o--o---o--o <-- refs/original/refs/heads/master \ / o--o <-- feature R'-o'-o'--o'-o' <-- master \ / o'-o'
feature still points to all the old (not-copied) commits, so that even if / after you get rid of any
refs/original/ references, git will retain all the still-referenced commits across any garbage-collect activity, giving:
R--o \ o--o <-- feature R'-o'-o'--o'-o' <-- master \ / o'-o'
filter-branch to update all the references, you need to name them all. An easy way to do that is to use
--all, which quite literally names all references. In this case, the initial “after” picture looks like this instead:
R--o--o---o--o <-- refs/original/refs/heads/master \ / o--o <-- refs/original/refs/heads/feature R'-o'-o'--o'-o' <-- master \ / o'-o' <-- feature
Now if you erase all the
refs/original/ references, all the old commits become unreferenced and can get garbage-collected. Well, that is, they do unless there are tags pointing to them.
For tag references,
filter-branch only updates them in any way if you supply a
--tag-name-filter. Usually you want
--tag-name-filter cat, which keeps the tag names unchanged, but makes
filter-branch point them to the newly copied commits. That way you don’t hang on to the old commits: the whole point of the exercise is to make everything use the new copies, and throw away the old copies, so that the big-file objects can be garbage-collected.
Putting this all together, instead of:
git filter-branch -f --tree-filter 'rm -rf folder/subfolder'
you can use:
git filter-branch -f --tree-filter 'rm -rf folder/subfolder' \ --tag-name-filter cat -- --all
(You don’t need the backslash-newline sequence; I put that in just to make the line fit better on stackoverflow. Note that
--tree-filter is very slow: for this particular case it is much faster to use
--index-filter. The index filter command here would be
git rm --cached --ignore-unmatch -r folder/subfolder.)
Note also that you need to do all this on (a copy of) the original repository (you did keep a backup, right?). (If you did not keep a backup, the
refs/originals/ may be your salvation.)
Edit: OK, so you did some
filter-branch-ing, and you did something that deleted any
refs/originals/. (In my experiment on a temp repo, running
git filter-branch on
HEAD used whatever branch I was on as the branch that was re-pointed, and made an “originals” copy of the previous value.) There are no backups of the repository. Now what?
Well, as a first step, make a backup now. This way, if things get any worse, you can at least get back to “only slightly bad”. To make a backup of the repo, you can simply clone it (or: clone it, then call the original the “backup”, then begin working on the clone). For future reference, since
git filter-branch can be quite destructive, it’s usually wise to start by doing this backing-up process. (Also, I’ll note that a clone on bitbucket, when not yet
pushed-to, would serve. Unfortunately you did a
push. Perhaps bitbucket can retrieve an earlier version of the repository from some backups or snapshots of their own.)
Next, let’s note a peculiarity of commits and their SHA-1 “true names”, that I mentioned earlier. The SHA-1 name of a commit is a cryptographic checksum of its contents. Let’s take a look at a sample commit in git’s own source tree (trimmed down a bit just for length, and email addresses whacked to foil harvesters):
$ git cat-file -p 5de7f500c13c8158696a68d86da1030313ddaf69 tree 73eee5d136d2b00c623c3fceceffab85c9e9b47e parent c4ad00f8ccb59a0ae0735e8e32b203d4bd835616 author Jeff King <peff peff.net> 1405233728 -0400 committer Junio C Hamano <gitster pobox.com> 1406567673 -0700 alloc: factor out commit index We keep a static counter to set the commit index on newly allocated objects. However, since we also need to set the [snip]
Here, we can see that the contents of this commit (whose “true name” is
5de7f50...) start with a
tree and another SHA-1, a
parent and another SHA-1, an
committer, then a blank line followed by the commit message text.
If you look at a
tree you’ll see that it contains the “true names” (SHA-1 values) of sub-trees (sub-directories) and file objects (“blobs”, in git terminology) along with their modes—really, just whether the blob should have execute permission set, or not—and their names within the directory. For instance, the first line of the above
100644 blob 5e98806c6cc246acef5f539ae191710a0c06ad3f .gitattributes
which means that the repository object
5e98806... should be extracted, put in a file named
.gitattributes, and set non-executable.
If I ask git to make a new commit, and set up, as its contents:
- the same tree (
- the same parent (
- the same author and committer
- and the same blank line and message
then when I get git to write that commit to the repository, it will generate the same “true name”
5de7f50.... In other words, it literally is the same commit: it’s already in the repository and
git commit-tree will just give me back the existing ID. While it’s a bit tricky to set all this up, that’s exactly what
git filter-branch ends up doing: it extracts the original commit, applies your filters, sets up everything, and then does a
What this means for you
On your original repo, you ran a
git filter-branch command that copied commits to new, modified commits (with different
trees and hence, at some point, different true names which led to different parent IDs in subsequent commits, and so on). However, if you copy those copied commits by applying a filter that this time does nothing, then the new
tree objects will be the same as the old ones. If the new parent is the same, and the author, committer, and message also all remain the same, the new commit-ID for the copy will be the same as the old ID.
That is, these new copies are not copies after all, they’re just the originals again!
Any other commits—those that were not copied in the first pass—do get copied, and hence have different IDs.
Here’s where things get tricky.
If your current repository looks like this (graphically speaking):
R--o--o---o--o <-- xxx [needs a name so that filter-branch will process it] \ / o--o <-- feature R'-o'-o'--o'-o' <-- master \ / o'-o'
and we apply a new
filter-branch to all references (or even “all but
master“) in such a way that it generates the same trees this time, it will copy
R again and the new tree will match that for
R', so the copy will actually be
R'. Then it will copy the first post-
R node, make the same changes, and the copy will actually be the first post-
o' node. This will repeat for all nodes, possibly even including
R' and all the
R', the resulting copy will just be
R' again, though, because “remove nonexistent directory” makes no change: our filter does nothing to these particular commits.
Finally, filter-branch will move the labels, leaving the
refs/originals/ versions behind:
R--o--o---o--o <-- refs/originals/refs/xxx \ / o--o <-- refs/originals/refs/feature R'-o'-o'--o'-o' <-- master, xxx \ / o'-o' <-- feature
This is, in fact, the desired outcome.
What if the repository looks more like this? That is, what if there is no
xxx or similar label pointing to the original (pre-filtering)
master, so that you have this:
R--o \ o--o <-- feature R'-o'-o'--o'-o' <-- master \ / o'-o'
filter-branch script will still copy
R and the result will still be
R'. Then it will copy the first
o node and the result will still be the first
o' node, and so on. It won’t copy the now-deleted nodes, but it won’t have to: we already have those, reachable via the branch-name
master. As before,
filter-branch may copy
R' and the various
o' nodes, but this is OK, as the filter will do nothing so that the copies are really just the originals after all.
filter-branch will, as usual, update the references:
R--o \ o--o <-- refs/originals/refs/feature R'-o'-o'--o'-o' <-- master \ / o'-o' <-- feature
The key that makes this all work is that the filter leaves already-modified commits untouched, so that their second “copies” are just the first-copies again.1
Once everything is done, you can do the same shrinking described in the
git filter-branch documentation to ditch the
refs/originals/ names and garbage-collect the now-unreferenced objects.
1If you had been using a filter that is not as easily repeated (e.g., one that makes new commits with “the current time” as their time-stamps), you would really need an untouched original repository, or those
refs/originals/ references (either one would suffice to keep an “original copy” around).