Does git de-duplicate between files?
If my repository contains several copies of the same files with only small changes (don’t ask why), will git save space by only storing the differences between the files?
2 Solutions collect form web for “Does git de-duplicate between files?”
It could, but it is very hard to say whether it will. There are situations where it is guaranteed that it won’t.
To understand this answer (and its limitations) we must look at the way git stores objects. There’s a good description of the format of “git objects” (as stored in
.git/objects/) in this stackoverflow answer or in the Pro Git book.
When storing “loose objects” like this—which git does for what we might call “active” objects—they are zlib-deflated, as the Pro Git book says, but not otherwise compressed. So two different (not bit-for-bit identical) files stored in two different objects are never compressed against each other.
On the other hand, eventually objects can be “packed” into a “pack file”. See another section of the Pro Git book for information on pack files. Objects stored in pack files are “delta-compressed” against other objects in the same file. Precisely what criteria git uses for choosing which objects are compressed against which other objects is quite obscure. Here’s a snippet from the Pro Git Book again:
When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next. You can look into the packfile and see what Git did to save space. The git verify-pack plumbing command allows you to see what was packed up […]
If git decides to delta-compress “pack entry for big file A” vs “pack entry for big file B”, then—and only then—can git save space in the way you asked.
Git makes pack files every time
git gc runs (or more precisely, through
git pack-objects and
git repack; higher level operations, including
git gc, run these for you when needed/appropriate). At this time, git gathers up loose objects, and/or explodes and re-packs existing packs. If your close-but-not-quite-identical files get delta-compressed against each other at this point, you may see some very large space-savings.
If you then go to modify the files, though, you’ll work on the expanded and uncompressed versions in your work tree and then
git add the result. This will make a new “loose object”, and by definition that won’t be delta-compressed against anything (no other loose object, nor any pack).
When you clone a repository, generally git makes packs (or even “thin packs”, which are packs that are not stand-alone) out of the objects to be transferred, so that what is sent across the Intertubes is as small as possible. So here you may get the benefit of delta compression even if the objects are loose in the source repository. Again, you’ll lose the benefit as soon as you start working on those files (turning them into loose objects), and regain it only if-and-when the loose objects are packed again and git’s heuristics compress them against each other.
The real takeaway here is that to find out, you can simply try it, using the method outlined in the Pro Git book.
will git save space by only storing the differences between the files?
Yes, git can pack the files into a compressed format.
You have two nearly identical 4K objects on your disk. Wouldn’t it be
nice if Git could store one of them in full but then the second object
only as the delta between it and the first?
It turns out that it can. The initial format in which Git saves
objects on disk is called a loose object format. However, occasionally
Git packs up several of these objects into a single binary file called
a packfile in order to save space and be more efficient. Git does this
if you have too many loose objects around, if you run the
command manually, or if you push to a remote server. To see what
happens, you can manually ask Git to pack up the objects by calling