Fun with Diff
and using Python to render Bitmaps

I was playing around with the Longest Common Subsequence algorithm (which is used, among other things, for the Diff utility) for another project that I hope to write about soon, and I noticed some interesting patterns. The LCS problem has an intermediate step in which a matrix is built, with every column corresponding to a unit in the first input set, withevery row corresponds to a unit in the second input set. The value stored in each cell gives the length of the longest subsequence if the two sequences both ended at that point in the sequence. Once this entire matrix has been appropriately populated, the program can backtrack through it picking the appropriate elements to find the longest common subsequence.

When I was working on getting LCS to work, I kept noticing that there would occasionally be interesting shapes that would show up in the matrix based on where common subsequences showed up. By slightly modifying the code to just generate a bitmap showing the identical lines between two input texts, I was able to get a better look at what was actually happening in the LCS algorithm. Here’s the quick python script I wrote to generate the bitmap and convert it into an image using PIL.

 1 import Image
 2 
 3 def buildBitMap(txt1, txt2):
 4     lcstable = []
 5     for x in range(len(txt1)):   
 6         lcstable = lcstable + [[0]*(len(txt2))]
 7     for x, p in enumerate(txt1):
 8         for y, q in enumerate(txt2):
 9             if p == q:
10                 lcstable[x][y] = 1
11     return lcstable
12     
13 def drawBitMap(a, txt1, output_file):
14     im = Image.new("RGB", (len(a),len(a[0])))
15     for x, row in enumerate(a):
16         for y, bit in enumerate(row):
17             if a[x][y] != 0:
18                 if txt1[x].strip() == '':
19                     im.putpixel((x,y),(255,0,0))
20                 else:
21                     im.putpixel((x,y),(0,255,0))
22     im.save("output_file", "PNG")
23 
24 text1 = open("test1.txt").readlines()
25 text2 = open("test2.txt").readlines()
26 table = buildBitMap(text1,text2)
27 drawBitMap(table, text1, 'output.png')

Here is the output from running that code on two versions of the source from this website. All colored pixels denote a line from the first file that was identical to the corresponding line in the second file. Red colored pixels mean that the line was blank in both files, green colored lines mean that there was actual content in both files that was identical.


Graphical Diff Output

I will admit that this output is not terribly useful, but it certainly looks pretty, and there is some information that can be gleaned from this display. Firstly, the most prominent feature is the diagonal green line from top left to bottom right. This is what the LCS algorithm is going to return as the Longest Subsequence. Gaps in that line show where content was added or deleted. If the gap is horizontal, it means that content was added between the two versions, and if it’s horizontal it means that the content is removed. The other interesting thing it shows is where the same content is repeated several times. There are a number of very short diagonal lines towards the top, and these happen to note a chunk of code that I repeat in several functions to determine which template to render with. It might be possible to diff a file against itself using this technique to find chunks of code that should be abstracted into functions.

I vaguely recall someone making a display very similar to this for diffing binary files. I think it was shown at a Defcon a few years back, though I can no longer find any reference to it. If anyone remembers seeing this, leave me a note in the comments.

Comments: