How do different version control systems handle binary files?
I have heard some claims that SVN handles binary files better than Git/Mercurial. Is this true and if so then why? As far as I can imagine, no version control system (VCS) can diff and merge changes between two revisions of the same binary resources.
So, aren’t all VCS’s bad at handling binary files? I am not very aware of the technical details behind particular VCS implementations so maybe they have some pros and cons.
- How do I edit a Git commit that is hundreds of merges and merge conflicts into the past?
- Version control for graphics alongside iOS project
- GIT SVN refetch remote branch that was deleted locally
- How to change git repository using android studio
- Copy individual commits between branches, preserving the SHA-1 tag
- how to embed DVCS revision information when building without it
5 Solutions collect form web for “How do different version control systems handle binary files?”
The main pain point is in the “Distributed” aspect of any DVCS: you are cloning everything (the all history of all files)
Since binaries aren’t stored in delta for most of them, and aren’t compressed as well as text file, if you are storing rapidly evolving binaries, you end up quickly with a large repository which becomes much cumbersome to move around (push/pull).
For Git for instance, see What are the git limits?.
Binaries aren’t a good fit for the feature a VCS can bring (diff, branch, merge), and are better managed in an artifact repository (like a Nexus for example).
This is not necessary the case for a CVCS (Centralized VCS) where the repository could play that role and be a storage for binaries (even if its not its primary role)
One clarification about git and binary files.
Git is compressing binary files as well as text files. So git is not crap at handling binary files as someone suggested.
Any file that Git adds will be compressed into loose objects. It doesn’t matter if they are binary or text. If you have a binary or text file and you commit it, the repository will grow. If you make a minor change to the file and commit again your repository will grow again at approximately the same amount depending on the compression ratio.
Then you make a
git gc. Git will find similarities in the binary or text files and compress them together. You will have a good compression if the similarities are large.
If, on the other hand there are no similarities between the files, you will not have much of a gain compressing them together compared to compressing them individually.
Here is a test with a bit-mapped picture (binary) that I changed a little:
martin@martin-laptop:~/testing123$ git init Initialized empty Git repository in /home/martin/testing123/.git/ martin@martin-laptop:~/testing123$ ls -l total 1252 -rw------- 1 martin martin 1279322 Jan 8 22:42 pic.bmp martin@martin-laptop:~/testing123$ git add . martin@martin-laptop:~/testing123$ git commit -a -m first [master (root-commit) 53886cf] first 1 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 pic.bmp // here is the size: martin@martin-laptop:~/testing123$ du -s .git 1244 .git // Changed a few pixels in the picture martin@martin-laptop:~/testing123$ git add . martin@martin-laptop:~/testing123$ git commit -a -m second [master da025e1] second 1 files changed, 0 insertions(+), 0 deletions(-) // here is the size: martin@martin-laptop:~/testing123$ du -s .git 2364 .git // As you can see the repo is twice as large // Now we run git gc to compress martin@martin-laptop:~/testing123$ git gc Counting objects: 6, done. Delta compression using up to 2 threads. Compressing objects: 100% (4/4), done. Writing objects: 100% (6/6), done. Total 6 (delta 1), reused 0 (delta 0) // here is the size after compression: martin@martin-laptop:~/testing123$ du -s .git 1236 .git // we are back to a smaller size than ever...
Git and Mercurial both handle binary files with aplomb. Thet don’t corrupt them, and you can check them in and out. The problem is one of size.
Source usually takes up less room than binary files. You might have 100K of source files that build a 100Mb binary. Thus, storing a single build in my repository could cause it to grow 30 times its size.
And it’s even worse:
Version control systems usually store files via some form of diff format. Let’s say I have a file of 100 lines and each line averages about 40 characters. That entire file is 4K in size. If I change a line in that file, and save that change, I’m only adding about 60 bytes to the size of my repository.
Now, let’s say I compiled and added that 100Mb file. I make a change in my source (maybe 10K or so in changes), recompile, and store the new binary build. Well, binaries don’t usually diff very well, so it’s very likely I’m adding another 100Mb of size to my repository. Do a few builds, and my repository size grows to several gigabytes in size, yet the source portion of my repository is till only a few dozen kilobytes.
The problem with Git and Mercurial is that you normally checkout the entire repository onto your system. Instead of merely downloading a few dozen kilobytes that can be transfered in a few seconds, I am now downloading several gigabytes of builds along with the few dozen kilobytes of data.
Maybe people say Subversion is better since I can simply checkout the version I want in Subversion and not download the whole repository. However, Subversion doesn’t give you an easy way to remove obsolete binaries from your repository, so your repository will grow and grow anyway. I still don’t recommend it. Heck, I don’t even recommend it even if the revision control system does allow you to remove old revisions of obsolete binaries. (Perforce, ClearCase, and CVS all do). It’s just ends up being a big maintenance headache.
Now, this isn’t to say you shouldn’t store any binary files. For example, if I am making a web page, I probably have some gifs and jpegs that I need. No problem storing those in either Subversion or Git/Mercurial. They’re relatively small, and probably change a lot less than my code itself.
What you shouldn’t store are built objects. These should be stored in a release repository and fetched as needed. Maven and Ant w/ Ivy does a great job of this. And, you can use the Maven repository structure in C, C++, and C# projects too.
In Subversion you can lock binary files to make sure that nobody else can edit them. This mostly assures you that nobody else will modify that binary file while you have it locked. Distributed VCSs don’t (and can’t) have locks–there’s no central repository for them to be registered at.
Text files have a natural line-oriented struture that binary files lack. This is why it’s harder to compare them using common text tools (diff). While it should be possible, the advantage of readability (the reason we use text as our preferred format in the first place) would be lost when applying diffs to binary files.
As to your suggestion that all version control systems “are crap at handling binary files”, I don’t know. In principle, there’s no reason why a binary file should be slower to process. I would rather say that the advantages of using a VCS (tracking, diffs, overview) are more apparent when handling text files.