Are Git's pack files deltas rather than snapshots?

One of the key differences between Git and most other version control systems is that the others tend to store commits as a series of deltas – changesets between one commit and the next. This seems logical, since it’s the smallest possible amount of information to store about a commit. But the longer the commit history gets, the more calculation it takes to compare ranges of revisions.

By contrast, Git stores a complete snapshot of the whole project in each revision. The reason this doesn’t make the repo size grow dramatically with each commit is each file in the project is stored as a file in the Git subdirectory, named for the hash of its contents. So if the contents haven’t changed, the hash hasn’t changed, and the commit just points to the same file. And there are other optimizations as well.

  • Best way to sync Heroku with git repo?
  • Git Push: What is the difference between HEAD:refs/heads/<branch> and <branch>?
  • How to tell if github repo ahead of my local git cloned repo
  • Reattempt conflict resolution during git rebase
  • How to see remote tags?
  • git lock keeps coming back on commit rendering GIT useless
  • All this made sense to me until I stumbled on this information about pack files, into which Git puts data periodically to save space:

    In order to save that space, Git
    utilizes the packfile. This is a
    format where Git will only save the
    part that has changed in the second
    file, with a pointer to the file it is
    similar to.

    Isn’t this basically going back to storing deltas? If not, how is it different? How does this avoid subjecting Git to the same problems other version controls systems have?

    For example, Subversion uses deltas, and rolling back 50 versions means undoing 50 diffs, whereas with Git you can just grab the appropriate snapshot. Unless git also stores 50 diffs in the packfiles… is there some mechanism that says “after some small number of deltas, we’ll store a whole new snapshot” so that we don’t pile up too large a changeset? How else might Git avoid the disadvantages of deltas?

  • files that have not been changed show up in unstaged list after git stash
  • Eclipse doesn't recognise git-based project
  • what is the best way to work across multiple machines with git?
  • git merge pull branches confusion
  • How to preserve the file permission in git?
  • Does 'git branch -u (or --set-upstream-to)' lose tracking information of all the existing remote tracking branches?
  • 3 Solutions collect form web for “Are Git's pack files deltas rather than snapshots?”

    Summary:
    Git’s pack files are carefully constructed to effectively use disk caches and
    provide “nice” access patterns for common commands and for reading recently referenced
    objects.


    Git’s pack file
    format is quite flexible (see Documentation/technical/pack-format.txt,
    or The Packfile in The Git Community Book).
    The pack files store objects in two main
    ways: “undeltified” (take the raw object data and deflate-compress
    it), or “deltified” (form a delta against some other object then
    deflate-compress the resulting delta data). The objects stored in
    a pack can be in any order (they do not (necessarily) have to be
    sorted by object type, object name, or any other attribute) and
    deltified objects can be made against any other suitable object of the same type.

    Git’s pack-objects command uses several heuristics to
    provide excellent locality of reference for common
    commands. These heuristics control both the selection of base
    objects for deltified objects and the order of the objects. Each
    mechanism is mostly independent, but they share some goals.

    Git does form long chains of delta compressed objects, but the
    heuristics try to make sure that only “old” objects are at the ends of
    the long chains. The delta base cache (who’s size is controlled by the
    core.deltaBaseCacheLimit configuration variable) is automatically
    used and can greatly reduce the number of “rebuilds” required for
    commands that need to read a large number of objects (e.g. git log
    -p
    ).

    Delta Compression Heuristic

    A typical Git repository stores a very large number of objects, so
    it can not reasonably compare them all to find the pairs (and
    chains) that will yield the smallest delta representations.

    The delta base selection heuristic is based on the idea that the
    good delta bases will be found among objects with similar filenames
    and sizes. Each type of object is processed separately (i.e. an
    object of one type will never be used as the delta base for an
    object of another type).

    For the purposes of delta base selection, the objects are sorted (primarily) by
    filename and then size. A window into this sorted list is used to limit
    the number of objects that are considered as potential delta bases.
    If a “good enough”1 delta representation is not found for an object
    among the objects in its window, then the object will not be delta
    compressed.

    The size of the window is controlled by the --window= option of
    git pack-objects, or the pack.window configuration variable. The
    maximum depth of a delta chain is controlled by the --depth=
    option of git pack-objects, or the pack.depth configuration
    variable. The --aggressive option of git gc greatly enlarges
    both the window size and the maximum depth to attempt to create
    a smaller pack file.

    The filename sort clumps together the objects for entries with with
    identical names (or at least similar endings (e.g. .c)). The size
    sort is from largest to smallest so that deltas that remove data are
    preferred to deltas that add data (since removal deltas have shorter
    representations) and so that the earlier, larger objects (usually
    newer) tend to be represented with plain compression.

    1
    What qualifies as “good enough” depends on the size of the object in question and its potential delta base as well as how deep its resulting delta chain would be.

    Object Ordering Heuristic

    Objects are stored in the pack files in a “most recently referenced”
    order. The objects needed to reconstruct the most recent history are
    placed earlier in the pack and they will be close together. This
    usually works well for OS disk caches.

    All the commit objects are sorted by commit date (most recent first)
    and stored together. This placement and ordering optimizes the disk
    accesses needed to walk the history graph and extract basic commit
    information (e.g. git log).

    The tree and blob objects are stored starting with the tree from the
    first stored (most recent) commit. Each tree is processed in a depth
    first fashion, storing any objects that have not already been
    stored. This puts all the trees and blobs required to reconstruct
    the most recent commit together in one place. Any trees and blobs that
    have not yet been saved but that are required for later commits are
    stored next, in the sorted commit order.

    The final object ordering is slightly affected by the delta base selection
    in that if an object is selected for delta representation and its base object
    has not been stored yet, then its base object is stored immediately before the
    deltified object itself. This prevents likely disk cache misses due to the
    non-linear access required to read a base object that would have “naturally” been
    stored later in the pack file.

    The use of delta storage in the pack file is just an implementation detail. At that level, Git doesn’t know why or how something changed from one revision to the next, rather it just knows that blob B is pretty similar to blob A except for these changes C. So it will only store blob A and changes C (if it chooses to do so – it could also choose to store blob A and blob B).

    When retrieving objects from the pack file, the delta storage is not exposed to the caller. The caller still sees complete blobs. So, Git works the same way it always has without the delta storage optimisation.

    As I mentioned in “What are git’s thin packs?”

    Git does deltification only in packfiles

    I detailed the delta encoding used for pack files in “Is the git binary diff algorithm (delta storage) standardized?”

    Note that the core.deltaBaseCacheLimit config which controls the default size for the pack file will soon be bumped from 16MB to 96MB, for Git 2.0.x/2.1 (Q3 2014).

    See commit 4874f54 by David Kastrup (May 2014):

    Bump core.deltaBaseCacheLimit to 96m

    The default of 16m causes serious thrashing for large delta chains combined with large files.

    Here are some benchmarks (pu variant of git blame):

    time git blame -C src/xdisp.c >/dev/null
    

    for a repository of Emacs repacked with git gc --aggressive (v1.9, resulting in a window size of 250) located on an SSD drive.
    The file in question has about 30000 lines, 1Mb of size, and a history with about
    2500 commits.

    16m (previous default):
      real  3m33.936s
      user  2m15.396s
      sys   1m17.352s
    
    96m:
      real  2m5.668s
      user  1m50.784s
      sys   0m14.288s
    
    Git Baby is a git and github fan, let's start git clone.