Govur University Logo
--> --> --> -->
...

How does Git's snapshot-based storage efficiently store project history without saving a full copy of every file in every commit?



Git efficiently stores project history by conceptualizing each commit as a complete snapshot of the project's files at a specific moment, rather than primarily storing only the differences (deltas) between versions. This snapshot model is made efficient through its content-addressable storage system, which identifies and stores data based on its unique content.

Every piece of information Git stores, including file contents, directory structures, and commit metadata, is immutable and is assigned a unique SHA-1 (or SHA-256 in newer versions) hash of its content. If two distinct files, directories, or other data elements have identical content, they will produce the exact same hash, and Git will store only one copy of that content object.

Git primarily uses three types of objects:
1. Blob objects: These store the exact content of a file. When a file is added or modified, Git calculates its hash. If an object with that hash already exists in the repository's storage, Git simply reuses it, avoiding duplication. If the content is new, a new blob object is created.
2. Tree objects: These represent a directory's state at a point in time. A tree object contains pointers (references by hash) to blob objects (for files within that directory) and other tree objects (for subdirectories). A tree object's hash is calculated based on its contents, which include the names, modes, and hashes of all the blobs and trees it references. If a directory's contents and structure (including its files and subdirectories) remain unchanged between commits, its tree object will have the same hash, and Git will reuse the existing tree object.
3. Commit objects: Each commit object represents a single commit. It contains metadata such as the author, committer, commit message, and crucially, a pointer (reference by hash) to the top-level tree object that defines the complete state of the project's working directory for that commit. It also contains pointers to its parent commit(s), forming the project's history.

When a new commit is created, Git only needs to store new blob objects for files whose content has actually changed. For any files whose content has not changed, the new commit's tree objects simply point to the *existingblob objects from previous commits. Similarly, if entire directories or subdirectories are unchanged, their existing tree objects are reused. This mechanism means that Git avoids saving a full copy of every file in every commit by efficiently reusing (referencing) existing, unchanged content and structure objects, only creating new objects for content that has genuinely changed.

To further optimize storage and network transfer, Git consolidates these individual objects into packfiles. A packfile is a single, compressed file that stores multiple Git objects. Within a packfile, delta encoding is often employed, which stores the differences between similar objects (e.g., two slightly different versions of the same file) rather than storing each object entirely. This significantly reduces the overall disk space required, especially for projects with extensive histories and many small, incremental changes, by applying a secondary compression layer on top of Git's fundamental snapshot-based and content-addressable storage.