Advantages of categorizing objects into folders named as the first 2 characters of SHA-1 string?

Git stores objects into categorized folders using the first 2 characters of the object’s SHA-1 string, what’s the advantage of this storage structure?

I think it cannot avoid any potential conflict, why not put all the objects into a flat folder?

  • Git workflow for composer package?
  • Git mirroring a repo with all submodules to another repo
  • Partial subtree upstream push in Git
  • Add multiple small git repo's to a larger repository
  • Gitolite and non-bare repos?
  • Using git svn can I import to git repository last 100 revisions?
  • Git directories - how to track directories but not content
  • Git, XIBs, merging
  • git log over ssh remote repository
  • How do I specify the commit message in the pom when using scm-maven-plugin?
  • What's the best practice to “git clone” into an existing folder?
  • Git pull request says “This branch has conflicts that must be resolved”
  • 2 Solutions collect form web for “Advantages of categorizing objects into folders named as the first 2 characters of SHA-1 string?”

    They are multiple reasons to store files using the following method:

    00/f56c0de1c61fdb926e79e8a0a65bd12930c9
    25/ec1c55bfb660548a6770238668c4b117d92f
    5d/4b01d98f17a9ad9dd1526b49ba39b5aa37a1
    63/6f740b6c284ce6685dc17d473a7360ace249
    b1/066d178188dde110149a8422ab651b0ee615
    b1/a2b7d02b7b0c43530677ab06235382a37e20
    da/a3ee5e6b4b0d3255bfef95601890afd80709
    

    The main reason is that you can be limited in the number of files you can store in a folder : Some (pretty old) file systems don’t allow you to store more than 64k files inside a directory. This is quite a small amount if you coun’t everything that git stores.

    Also, because you use a hashing algorithm, you’re almost sure that a folder won’t have too many files : the files should be equally spread across the subfolders (at least when the number of files grow)

    I also think that it could cause performance issues on some filesystems to have too many files in a folder (altough I’m not 100% sure of that)

    That loose objects structure (example here, and in Git Internal – Packfiles) represents how the objects are stored initially in Git.

    You can see that approach used elsewhere too (for images database for instance, actually on two levels, but that applies for Git too):

    compute the SHA-1 hash of the image, generate its hexadecimal form, and use the first two characters of the SHA-1 string as a first-level directory.

    SHA1 hashes give good distribution, even in the first few characters, so that will nicely distribute the files into a (relatively) balanced folder structure.
    This simplistic approach will use no more than 256 folders at each level.

    Using the hexadecimal form of the image’s SHA-1 has two very nice benefits:

    • no name collisions, and
    • any given file will only be stored once even if the same file is uploaded more than once.

    See gitrepository-layout:

    objects/[0-9a-f][0-9a-f]

    A newly created object is stored in its own file.
    The objects are splayed over 256 subdirectories using the first two characters of the sha1 object name to keep the number of directory entries in objects itself to a manageable number.
    Objects found here are often called unpacked (or loose) objects.

    git commit 88520ca gives us more information about the benefit of that structure, which influences when gc is run:

    search 4 directories to improve statistic of gc hint

    On Windows, git-gui suggests running the garbage collector if it finds
    1 or more files in .git/objects/42 (as opposed to 8 files on other
    platforms).
    The probability of that happening if the repo contains about 100 loose objects is 32%.
    The probability for the same to happen when searching 4 directories is only 8%, which is bit more reasonable.

    The following octave script shows the probability for at least m*q
    objects to be found in q subdirectories of .git/objects if n is the
    total number of objects.

    (It uses the Cumulative Distribution function (CDF) for binomial distribution binocdf)

    q = 4;
    m = [1 2 8];
    n = 0:10:2000;
    
    P = zeros(length(n), length(m));
    for k = 1:length(n)
            P(k, :) = 1-binocdf(q*m-1, n(k), q/(256-q));
    end
    plot(n, P);
    
    n \ q   1       4
    50      18%     1%
    100     32%     8%
    200     54%     39%
    500     86%     96%
    

    That organization also allows for Git’s packing heuristics to be as quick and efficient as possible, when git gc has to occur.

    Git Baby is a git and github fan, let's start git clone.