Why is the Python calculated “hashlib.sha1” different from “git hash-object” for a file?

I’m trying to calculate the SHA-1 value of a file.

I’ve fabricated this script:

  • Getting tags in Git Push on pre-receive
  • Export and import database on Git push and pull
  • Git workflow idea to push an unfinished local branch to remote for backup purposes
  • Two variants of same code in git: Branch, fork, or make separate repository?
  • Is there a way to lock a branch in GIT
  • Detecting changes to remote branch
  • def hashfile(filepath):
        sha1 = hashlib.sha1()
        f = open(filepath, 'rb')
        return sha1.hexdigest()

    For a specific file I get this hash value:
    But when i calculate the value with git hash_object, then I get this value: d339346ca154f6ed9e92205c3c5c38112e761eb7

    How come they differ? Am I doing something wrong, or can I just ignore the difference?

  • How to protect against pushing large binary blobs in git?
  • Conditional pre-commit hook controlled from command line for GIT: Is it possible?
  • Emacs Based Git Diff Tool?
  • Phonegap source control gotchas?
  • Commit file to github and then unstage because it contains sensitive data
  • What is a VCS repository?
  • 2 Solutions collect form web for “Why is the Python calculated “hashlib.sha1” different from “git hash-object” for a file?”

    git calculates hashes like this:

    sha1("blob " + filesize + "\0" + data)


    For reference, here’s a more concise version:

    def sha1OfFile(filepath):
        import hashlib
        with open(filepath, 'rb') as f:
            return hashlib.sha1(f.read()).hexdigest()

    On second thought: although I’ve never seen it, I think there’s potential for f.read() to return less than the full file, or for a many-gigabyte file, for f.read() to run out of memory. For everyone’s edification, let’s consider how to fix that: A first fix to that is:

    def sha1OfFile(filepath):
        import hashlib
        sha = hashlib.sha1()
        with open(filepath, 'rb') as f:
            for line in f:
            return sha.hexdigest()

    However, there’s no guarantee that '\n' appears in the file at all, so the fact that the for loop will give us blocks of the file that end in '\n' could give us the same problem we had originally. Sadly, I don’t see any similarly Pythonic way to iterate over blocks of the file as large as possible, which, I think, means we are stuck with a while True: ... break loop and with a magic number for the block size:

    def sha1OfFile(filepath):
        import hashlib
        sha = hashlib.sha1()
        with open(filepath, 'rb') as f:
            while True:
                block = f.read(2**10) # Magic number: one-megabyte blocks.
                if not block: break
            return sha.hexdigest()

    Of course, who’s to say we can store one-megabyte strings. We probably can, but what if we are on a tiny embedded computer?

    I wish I could think of a cleaner way that is guaranteed to not run out of memory on enormous files and that doesn’t have magic numbers and that performs as well as the original simple Pythonic solution.

    Git Baby is a git and github fan, let's start git clone.