What's the difference between `git diff –patience` and `git diff –histogram`?
This earlier question asked for the differences between 4 different Git diff strategies, but the only difference that was explained was the difference between
patience, which is pretty well explained elsewhere.
How does the
histogram strategy work? What differentiates it from
patience? The git-diff man page only says that it “extends the patience algorithm to “support low-occurrence common elements”.” Other pages mention that it’s faster, and that it comes from JGit, but they don’t explain where or how its algorithm or results will differ from
Where can I find a description of the
histogram algorithm relative to the
patience algorithm, with the same level of detail as Bram Cohen’s original description of the
(If it’s just a matter of implementation performance with no case that will produce different results, why wasn’t it just implemented as a new backend for
One Solution collect form web for “What's the difference between `git diff –patience` and `git diff –histogram`?”
This histogram strategy was introduced in git 1.7.7 (Sept 2011), with the following description (as mentioned by the OP)
git diff” learned a “
--histogram” option to use a different diff generation machinery stolen from jgit, which might give better performance.
The description there is fairly complete:
An extended form of Bram Cohen’s patience diff algorithm.
This implementation was derived by using the 4 rules that are outlined in
Bram Cohen’s blog, and then was further extended to support low-occurrence common elements.
The basic idea of the algorithm is to create a histogram of occurrences for each element of sequence A. Each element of sequence B is then considered in turn. If the element also exists in sequence A, and has a lower occurrence count, the positions are considered as a candidate for the longest common subsequence (LCS).
After scanning of B is complete the LCS that has the lowest number of occurrences is chosen as a split point. The region is split around the LCS, and the algorithm is recursively applied to the sections before and after the LCS.
By always selecting a LCS position with the lowest occurrence count, this algorithm behaves exactly like Bram Cohen’s patience diff whenever there is a unique common element available between the two sequences.
When no unique elements exist, the lowest occurrence element is chosen instead.
This offers more readable diffs than simply falling back on the standard Myers’
O(ND)algorithm would produce.
To prevent the algorithm from having an
O(N^2)running time, an upper limit on the number of unique elements in a histogram bucket is configured by
If sequence A has more than this many elements that hash into the same hash bucket, the algorithm passes the region to
If no fallback algorithm is configured, the region is emitted as a replace edit.
During scanning of sequence B, any element of A that occurs more than
#setMaxChainLength(int)times is never considered for an LCS match position, even if it is common between the two sequences. This limits the number of locations in sequence A that must be considered to find the LCS,and helps maintain a lower running time bound.
So long as
#setMaxChainLength(int)is a small constant (such as 64), the algorithm runs in
O(N * D)time, where
Nis the sum of the input lengths and
Dis the number of edits in the resulting
If the supplied
SequenceComparatorhas a good hash function, this implementation typically out-performs
MyersDiff, even though its theoretical running time is the same.
This implementation has an internal limitation that prevents it from handling sequences with more than 268,435,456 (2^28) elements
Note that this kind of algo was already used for pack_check, back in 2006 (git 1.3), for
git-verify-pack -v. It was reused for index-pack in git 1.7.7
Commit 8c912ee actually introduced
--histogram to diff:
Port JGit’s HistogramDiff algorithm over to C. Rough numbers (TODO) show
that it is faster than its
--patiencecousin, as well as the default Meyers algorithm.
The implementation has been reworked to use structs and pointers,
instead of bitmasks, thus doing away with JGit’s
We also use
xdiff‘s default hash table implementation (
XDL_HASHLONG()) for convenience.
commit 8555123 (git 1.7.10, April 2012) added:
diff, 2011-07-12) claimed histogram diff
was faster than both Myers and patience.
We have since incorporated a performance testing framework, so add a
test that compares the various diff tasks performed in a real ‘
This does indeed show that histogram diff slightly beats Myers, while patience is much slower than the others.
Finally, commit 07ab4de (git 1.8.2, March 2013) add
config: Introduce diff.algorithm variable
Some users or projects prefer different algorithms over others, e.g. patience over myers or similar.
However, specifying appropriate argument every time diff is to be used is impractical. Moreover, creating an alias doesn’t play nicely with other tools based on diff (
Hence, a configuration variable which is able to set specific algorithm is needed.
For now, these four values are accepted:
myers‘ (which has the same effect as not setting the config variable at all),
Commit 07924d4 added concurrently the
--diff-algorithm command line option.
As the OP Stuart P. Bentley mentions in the comments:
you can configure Git to use histogram by default with:
git config --global diff.algorithm histogram
Update: Git 2.12 (Q1 2017) will retire the “fast hash” that had disastrous performance issues in some corner cases.
See commit 1f7c926 (01 Dec 2016) by Jeff King (
(Merged by Junio C Hamano —
gitster — in commit 731490b, 19 Dec 2016)
xdiffcode hashes every line of both sides of a diff, and then compares those hashes to find duplicates. The overall performance depends both on how fast we can compute the hashes, but also on how many hash collisions we see.
The idea of
XDL_FAST_HASHis to speed up the hash computation.
But the generated hashes have worse collision behavior. This means that in some cases it speeds diffs up (running “
git log -p” on
~8%with it), but in others it can slow things down. One pathological case saw over a 100x slowdown.
There may be a better hash function that covers both properties, but in the meantime we are better off with the original hash. It’s slightly slower in the common case, but it has fewer surprising pathological cases.