Impact of large number of branches in a git repo?
Does anyone know what the impact is of a git repo that has a lot of branches (2000+)? Does git pull or git fetch slow down due to having that many branches? Please provide benchmarks if there is a difference.
4 Solutions collect form web for “Impact of large number of branches in a git repo?”
As others have pointed out, branches and other refs are just files in the file system (except that’s not quite true because of packed refs) and are pretty cheap, but that doesn’t mean their number can’t affect performance. See e.g. the Poor push performance with large number of refs thread on the Git mailing list for a recent (Dec 2014) example of Git performance being affected by having 20k refs in a repository.
If I recall correctly, some part of the ref processing was O(n²) a few years ago but that can very well have been fixed since. There’s a repo-discuss thread from March 2012 that contains some potentially useful details, if perhaps dated and specific to JGit.
The also somewhat dated Scaling Gerrit article talks about (among other things) potential problems with high ref counts, but also notes that several sites have gits with over 100k refs. We have a git with ~150k refs and I don’t think we’re seeing any performance issues with it.
One aspect of having lots of refs is the size of the ref advertisement at the start of some Git transactions. The size of the advertisement of aforementioned 150k ref git is about 10 MB, i.e. every single
git fetch operation is going to download that amount of data.
So yes, don’t ignore the issue completely but you shouldn’t lose any sleep over a mere 2000 refs.
I don’t have benchmarks but one way to ensure a
git fetch remains reasonable even if the upstream repo has a large set of branches would be to specific a less general refspec than the one by default.
fetch = +refs/heads/*:refs/remotes/origin/*
You can add as many fetch refspecs to a remote as you want, effectively replacing the catch-all refspec above with more specific specs to just include the branches you actually need (even though the remote repo has thousands of them)
fetch = +refs/heads/master:refs/remotes/origin/master fetch = +refs/heads/br*:refs/remotes/origin/br* fetch = +refs/heads/mybranch:refs/remotes/origin/mybranch ....
Yes, it does. Locally, it’s not much of a problem–though it does still affect several local commands. In particular, when you are trying to describe a commit based on the available refs.
Over the network, Git does an initial ref advertisement when you connect to it for updates. You can learn about this in the pack protocol document. The problem here is that your network connection may be flaky or latent, and that initial advertisement can take a while as a result. There has been discussions of removing this requirement, but, as always, compatibility issues make it complicated. The most recent discussion about it is here.
You probably want to look at a recent discussion about Git scaling too. There’s many ways in which you may want Git to scale, and it’s discussed the majority of them so far. I think it gives you a good idea what Git is good at, and where it could use some work. I’d summarize it for you, but I don’t think I could do it justice. There’s a lot of useful information there.
In order to answer your question, you should know how Git handle branches. What are branches?
A branch is only a reference to a commit to the local repo, creating branches is very cheap.
.git directory contains directories that contains metadata that git uses, when you create a branch, what happens is that a reference is created to local branch and a history log is created. In other words, creating branches is creating files and references, the system can easily handle 2000 files.
I advise you to go through 3.1 Git Branching – Branches in a Nutshell, it contains information that might help you to better undestand how branches are handled.