After deleting a binary file from Git history why is my repository still large?

So let me preface this question by saying that I am aware of the previous questions pertaining to subject on Stackoverflow. In fact I’ve tried all the solutions I could find but there is a binary file in my repo that just refuses to be removed and continues to greatly inflate my repo size.

Methods I’ve tried,

  • Updating file permissions with git-bash on Windows 7
  • Git “You have not concluded your merge” and nothing to commit?
  • How to make a local git repository that mirrors an upstream repository?
  • Reverting bad branch merge
  • PHP Composer mixed with Git Submodules and Symfony2
  • “${1-}” vs “$1”
    • David Underhill’s script
    • Github’s Howto

    Both of which were recommend by the Darhuuk’s answer to Remove files from git repo completely

    However, after trying both of those solutions the script to find large files in git still finds the offending binary. However the script from this answer no longer finds the commit for the binary. Both of these scripts were suggest by this answer.

    The repo is still 44mb after the attempts at removal, which is way too large for the relative small size of the source. Which suggestions the large file script is doing it’s job properly. I’ve tried pushing up to github (I made a fork just in case) and then doing a fresh clone to see if the repo size was decreased, but it is still the same size.

    Can someone explain what I am doing wrong or suggest an alternative method?

    I should note that I am not just interested in trimming the file from my local repo, I also want to be able to fix the remote repo on Github.

  • git: How to create a branch of my current work, but stay on my original branch
  • Jenkins Build Using Git with Deploy Key
  • Git stash pop- needs merge, unable to refresh index
  • Get Travis Shield on Github to Reflect Selected Branch Status
  • Git asking for password.
  • If I have 10 branches locally, how to push them to remote origin server?
  • 4 Solutions collect form web for “After deleting a binary file from Git history why is my repository still large?”

    2017 Edit: You should probably look into BFG Repo-Cleaner if you are reading this.


    So embarrassingly the reason why my local repos were not shrinking in size is because I was using the wrong path to the file in filter-branch. So while I thank J-16 SDiZ and CodeGnome for their answers my problem was between the chair and the keyboard.

    In an effort to make this question less of a monument to my stupidity and actually useful to people I’ve taken the time to write up the steps one would have to go through after trimming the repo in order to get the repo back up on Github. Hope this helps someone out down the line.


    Removing offending files

    To go about remove the offending files run the shell script below, based the Github remove sensitive data howto

    #!/usr/bin/env bash
    git filter-branch --index-filter 'git rm -r -q --cached --ignore-unmatch '$1'' --prune-empty --tag-name-filter cat -- --all
    
    rm -rf .git/refs/original/
    git reflog expire --expire=now --all
    git gc --prune=now
    git gc --aggressive --prune=now
    

    I went through every branch on my local repository and did this, but I am honestly not sure if this is needed, (you don’t need to do this on every branch) you do however need every branch local for the next step, so keep that in mind. Once you are done you should see the size decrease in your local repo. You should also be able to run the blob script in CodeGnome’s answer and see the offending blob remove. If not double check the file name and path and make sure they are correct.

    What git filter-branch is actually doing here is running the command listed in quotes on each commit in the repo.

    The rest of the script just cleans any cached version of the old data.

    Pushing the trimmed repo

    Now that the local repo is in the state you need it to be the trick is to get it back up on Github. Unfortunately as far as I can tell there is no way to completely remove the binary data from the Github repo, here is the quote from the Github sensitive data howto

    Be warned that force-pushing does not erase commits on the remote repo, it simply introduces new ones and moves the branch pointer to point to them. If you are worried about users accessing the bad commits directly via SHA1, you will have to delete the repo and recreate it.

    It sucks that you need to recreate the Github repo, but the good news that recreating the repo is actually pretty easy. The pain is that you also have to recreating the data in issues and the wiki, which I’ll go into below.

    What I recommend is creating a new repo in github and then switch it out with your old repo when you are ready. This can be done by renaming the old to something like “repo name old” and then changing the name of the newly created repo to “repo name”. Make sure when you create the new repo to uncheck initialize with README, otherwise your not going to be dealing with a clean slate.

    If you completed the last step you should have your repo cleaned and ready to go. The remotes now need to changed to match the new Github repo location. I do this by editing the .git/config file directly, though I am sure someone is going to tell me that is not the right way to do it.

    Before doing the push make sure you have all branches and tags you want to push up in your local repo. Once you are ready push all branches using the follow

    git push --all
    git push --tags
    

    Now you should have a remote repo to match your trimmed local repo. Double check that all data made just in case.

    Now if you don’t have to worry about issues or the wiki you are done. If you do read on.

    Moving over wikis

    The Github wiki is just another repo associated with your main repo. So to get started clone your old wiki repo somewhere. Then the next part is kind of tricky, as far as I can tell you need to click on the wiki tab of your new repo in order to create the wiki, but it seeds the newly created wiki with a an initial file. So what I did, and I am not sure if there is a better way, is change the remote to the newly create wiki repo and do a push to the new location using

    git push --all --force
    

    The force is needed here because otherwise git will complain about the tip of the current branch not matching. I think this may leave the initial page in a detached state in the git repo, but the effect of that on the size of the repo should be negligible.

    Moving over issues

    There is advice on this given by this answer. But looking at the script linked in the answer it looks like it is fairly incomplete, there is a TODO for comment importing and I couldn’t tell if it would be bring over the state of issues or not.

    So given that I had a fairly small open issues queue and that I didn’t mind losing closed issues I elected to bring things over by hand. Note that it is impossible to do this with proper attribution to other people on comments. So I think for a large more established project you would need to write a more robust script to bring everything over, but that wasn’t needed for my particular case.

    Assuming that you’ve already removed the blob from your history with git-filter-branch(1) and friends, Git often keeps things around in the reflogs, packfiles, and loose repository objects. The incantation to remove these unreferenced objects is:

    git prune --expire=now
    git reflog expire --expire-unreachable=now --rewrite --all
    git repack -a -d
    git prune-packed
    

    If you’ve done this and you still have a bigger repository than you think you should, then you still have references to your blob somewhere in the repository. You’ll have to go back to step one and remove them. This may help:

    # List all blobs by size in bytes.
    git rev-list --all --objects   |
        awk '{print $1}'           |
        git cat-file --batch-check |
        fgrep blob                 |
        sort -k3nr
    

    The script in script to find large files in git check the .pack file — that is, the raw object repository. The second script shows the large object is no longer referenced. If you really want to clean that up, you may do a gc and repack:

    git gc --aggressive --prune=now
    git repack -A -d
    

    If this still don’t help, you may have an object reference in remote branch, you may try

    1. Find out which commit have this object, see Which commit has this blob? and do git branch -a --contains <commit-ish>
    2. Remove the remote branch using git branch -r -D branchname

    Update — What is a “remote branch”?

    • Remote branch is what git fetch things to when you do a git fetch / git pull. (git pull is same as git fetch refspec + git merge remote-branch.

    • If you clone from a remote repository, deleting the remote branch should have no ill effect — you can always fetch/pull from the remote again using something like git fetch origin refs/heads/master:refs/remotes/origin/master (this pull the master branch from remote to the remote branch remotes/origin/master).

    • If this branch was created by you, deleting should be okay too — because you should have a “normal” (tracking) branch for that. But you should double confirm this.

    Can someone explain what I am doing wrong or suggest an alternative method?

    Have you tried applying DMAIC? Define, Measure, Analyze, Improve, Control.

    D – My repo is still large after deleting a file from git history.
    M – Determine size of fresh repo using git init to establish baseline.
    A – Identify, validate and select root cause. Experiment with git-repo-analysis.
    I – Identify, test and implement solution. Maybe BFG Repo-Cleaner will help. Maybe it won’t.
    C – Sustain the gains. Look at something like Git LFS or other appropriate control method.

    I also want to be able to fix the remote repo on Github.

    This will depend on how you choose to resolve the problem. For exaple, when using BFG to trim files from history it’ll rewrite history and update commit SHAs so there’s going to be some give and take here depending on your specific needs and desired outcomes.

    Git Baby is a git and github fan, let's start git clone.