git very slow with many ignored files

I have set up a repository to include a working directory that has many tens of thousands of files, thousands of directories, with many Gb of data. This directory is located on a samba share. I only want to have a few dozen source files within this directory under version control.

I have set up the gitignore file thusly and it works:

# Ignore everything
*

# Except a couple of files in any directory
!*.pin
!*.bsh
!*/

Operations on the repository (such as commit) takes several minute to carry out. This is too long to reasonably get any work done. I suspect that the slowdown is because git is trawling through every directory looking for files that may have been updated.

There are only a few locations in the working directory where I have files that I want to track, so I tried to narrow down the set of files to examine using this query:

*
!/version_2/analysis/abcd.pin
!/version_2/analysis/*.bsh
!*/

This also works, but it is still just as slow as the less qualified gitignore. I’m guessing it is that final line that is the killer, but no matter how I tried to make the unignore patterns be very specific, I always had to include that final wildcard clause in order for the process to find any files to commit.

So my two part question is

1) Is there a better way to set up the gitignore file that will help speed up the commit process by only including the very narrow set of directories and file types that contain relevant results?

2) Is there some other tweaks to git or samba that are required to make this work more efficiently?

Thanks,

Tom

  • Why do I see an exclamation mark still?
  • TortoiseGit can't find git.exe
  • Does the .git directory hold remote passwords or SSH keys?
  • tortoisegit does not pull
  • .gitignore into the project root directory is not overriden by .gitignores into children directories
  • How do I know local repository is behind the origin and FF is possible in TortoiseGit?
  • TortoiseGit Issue Tracker Integration for GitHub
  • Is it possible to have multiple local git repositories in the same folder?
  • 2 Solutions collect form web for “git very slow with many ignored files”

    After fiddling around for a bit, I have found a way to significantly improve performance by just modifying the .gitignore file.

    The performance problem was caused by my approach of ignoring all and then specifying what to unignore. This had a nice concise specification (4 lines), but was really slow. It caused git to walk the entire directory tree in order to detect what changed.

    My new and improved approved approach is to just use exclude patterns. Using this I can indicate large branches of the tree to prune. I had to add a more lengthy set of documents and file types to exclude and this took a few iterations to get right because there were so many. Due to the nature of the data sets there may be more maintenance of the .gitignore file required in future if new file types show up, but this is a small price to pay.

    Here is something like what my final .gitignore file looks like:

    # prune large input data and results folders where ever they occur
    ../data/
    ../results/
    
    # Exclude document types that don't need versioning,
    # leaving only the types of interest
    *~
    *#
    *.csv
    *.doc
    *.docx
    *.gif
    *.htm
    *.html
    *.ini
    *.jpg
    *.odt
    *.pdf
    *.png
    *.ppt
    *.pptx
    *.xls
    *.xlsx
    *.xlsm
    *.xml
    *.rar
    *.zip
    

    Commit times are now down to a few seconds.

    Overall this is still pretty simple, although not as clean as my initial 4-liner.

    After review, I think my problem was that I became a victim of my own premature optimization.

    There’s not a lot you can do about this, unfortunately – at least, not without restructuring your repo. Your supposition is correct – because you have a very large working tree with lots of individual files, git is going trawling through them all. And no, tweaking your .gitignore won’t help – internally, as far as I know, git still follows each folder path, and only ignores files (not folders) that match the pattern specified in the .gitignore.

    And, quite naturally, this is made substantially worse by the fact that this is on a network share, meaning that every trip back and forth to the file system (of which many are made for just about any “standard” git operation) is done at the speed of network latency (even a few ms per file adds up over many thousands of files).

    I don’t believe file size is the issue here, unfortunately, so the suggestion given in the comments (symlinking) likely won’t give you any speedup due to the fact that your slowdown factor seems to be the number of files.

    What you could do is move all of the untracked files outside of the repo – if they make up the bulk of the number of files, it should provide you with a substantial speedup. This may not necessarily be possible, but it’s about the only thing I can think of short of moving the repo to your local machine (which may not necessarily be possible either).

    Git Baby is a git and github fan, let's start git clone.