Have you ever used a tool like git diff
or the diff
command in your terminal to see what changes have been made to a file? It's one of the most fundamental utilities for developers, allowing us to quickly compare two versions of a file and understand the differences. But have you ever wondered how it actually works? The secret lies in a classic computer science problem you might have encountered on platforms like LeetCode: the Longest Common Subsequence (LCS).
Dynamic Programming: Longest Common Subsequence (LCS)
The longest common subsequence is classic dynamic programming problem.
What is the Longest Common Subsequence (LCS)?
The Longest Common Subsequence problem is a perfect example of a dynamic programming. Given two sequences (which can be strings, arrays, or in our case, lines of text in a file), the goal is to find the longest subsequence that is common to both. A subsequence doesn't require the characters to be contiguous, but they must maintain their original order.
For example, consider the sequences "ABCD" and "ACBAD".
The common subsequences are "A", "B", "C", "D", "AB", "AC", "AD", "BD", "ACD".
The longest of these is "ACD", which has a length of 3.
The Real World Application: the diff utility
The diff
command, and by extension, the core of how version control systems like Git work, is a direct application of the LCS algorithm. Instead of comparing characters, diff
treats each line of a file as a single element in a sequence. It finds the longest common subsequence of lines between the two versions of a file.
The lines that are part of this longest common subsequence are considered "unchanged" and form the anchor points for the comparison. The lines that are not in the LCS are the ones that have been added, deleted, or modified.
A simple diff utility in C++:
I have implemented a simple diff utility in C++. It takes two file path and generate a diff using LCS.
Do take a look and maybe try implementing in your favorite language.
Diff in Practice: Beyond Simple LCS
While the Longest Common Subsequence (LCS) algorithm provides a solid foundation for understanding how diff
works, real-world diff utilities employ more sophisticated and specialized algorithms. The basic LCS approach can be computationally expensive for very large files and sometimes produces diffs that are not as intuitive for a human to read.
In practice, there is no single "diff algorithm." Instead, a family of algorithms and heuristics have been developed over the decades, each with its own advantages:
Hunt-McIlroy Algorithm: This is the historical ancestor of many modern diff tools. The original Unix
diff
program from the 1970s was based on this algorithm. It is a highly optimized derivative of the basic LCS idea, designed to efficiently find a common subsequence of lines. Many of today's diff implementations are still based on this classic approach.Patience Diff: Developed by Bram Cohen (the creator of BitTorrent), Patience Diff is a fascinating alternative to traditional LCS-based algorithms [Bram’s post] Its primary goal is not just to find the shortest possible diff, but to produce a diff that is more human-readable. It works by identifying unique lines that appear in both files and using these as "anchors" to find larger blocks of unchanged code. This method is particularly effective for diffs with a lot of scattered, minor changes and is a great option in Git.
Meyers Diff Algorithm: This is another highly optimized algorithm that is widely used, including in popular libraries like Google's Diff-Match-Patch and the underlying C library that Git uses for its default diffs. The Meyers algorithm is known for its speed and ability to find the shortest possible diff, often outperforming basic LCS implementations. It's a prime example of an algorithm that improves on the foundational concepts to meet the performance demands of modern software development: More details
Operational Transformations (OT): While not a traditional
diff
algorithm, OT is a completely different paradigm for managing changes, most famously used in real-time collaborative editors like Google Docs. Instead of just comparing two documents to find the final differences, OT "reverse engineers" the discrete editing operations (insert, delete, change) that led to the changes. This allows multiple users to edit a document simultaneously and have their changes merged correctly without conflicts: More details
This evolution of diff algorithms demonstrates that while a foundational concept like LCS is essential, real-world applications often require specialized solutions to address issues like performance, readability, and concurrency. When you use git diff
, you're interacting with a tool that sits on the shoulders of these decades of algorithmic innovation.