Git Under The Hood

Akash Yadav
CodeX
Published in
4 min readOct 5, 2021

--

How does git store the information?

At its core, git is like a key-value store. Here key refers to the hash of the data or content that is going to be stored and value refers to the actual content

The hash key generated by git on giving a piece of data uses a SHA1 hash function which is a 40 digit Hexadecimal number. The generated hash key will always be the same if the given input data is the same. This type of system is also known as a content-addressable system

The way git stores information is in the form of Git Objects known as

  • blob
  • tree
  • commit

Let’s demystify them one by one!

Blob :

Git stores compressed data in a blob along with some metadata in its header.

Information that gets stored in git objects are :

  • type i.e blob
  • size of content
  • \0 null character
  • content

The plumbing or internal command used by git in order to generate git blobs is git hash-object. So if we ask git to generate the SHA1 of the string i.e “demystifying internals of git”, we get the hash “a1f544d55619fab2909286d9c6b5c5fa7b8825db”

generated sha1 hash using git hash-object command

Now let’s generate the same SHA1 of the content as it gets stored in git with any other tool. I am using OpenSSL command-line tool, for demonstration purposes. All other tools or libraries from programming languages such as python or node will yield the same result

hash generated using third party tool openssl

See hash generated by OpenSSL for string

‘blob 30\0demystifying internals of git’ is same as generated by git hash-object

So where does git store these git objects?

Git stores these objects in the .git directory. It gets created when you run the git init command in an empty git directory. Below is the screenshot of the GitUnderTheHood git repository and as you can see the generated blob object is stored in the .git/objects directory with the first 2 characters representing directory name and the rest of the characters representing a file

Tree :

Blob is the basic unit of git object but it misses some information i.e

  • filenames
  • directory structure

So if we save a file as a blob, how do we know the information about the file and the directory structure it was stored in. Git stores this information in a special git object known as a tree.

So a tree in git is a directed acyclic graph a.k.a DAG. It contains a pointer using SHA1

  • to blobs
  • to another tree

along with some metadata

  • type of pointer ( blob or tree )
  • filename
  • file mode ( file mode provides the information about the file category such as is it a directory, an executable file, or a symlink ? )

Here is the sample screenshot of my git repository where a.txt content is the same as a-copy.txt present in /copy directory, therefore its hash will be too.

The DAG will be looking something like this :

DAG for the git repo

Because of the sha1 hash key which is unique for a given piece of content, in git identical content is always stored once. This is how git optimizes memory usage. That’s why checking out branches in git is super fast. It’s just a change in pointer reference from one git object to another.

Commit :

A commit is a code snapshot of what the project looked like at that point in time. It’s the third type of git object which points to a tree, and it also contains some metadata such as

  • committer or author
  • date
  • message
  • parent commits

If you change any data about the commit, the commit is going to have a new SHA1 generated. even if the commit is pointing to the same tree/ file contents, there will be a change in date and time hence new hash will be created.

If you want to look at the content in one of these git objects files stored in the .git/objects directory using cat, you won’t find anything useful as these is binary compressed objects. In order to see the content and type of the object we can use another plumbing command of git i.e git cat-file with -p or -t flags respectively

  • git cat-file -t <hash> will print the type of git object
  • git cat-file -p <hash> will print the content of the git object

--

--