Bits and pieces: Comparing Binary Data (and More)

If a standard Linux utility doesn’t do quite what you want, adapt it.

A main source of data on a Linux system is a file. Linux data also moves in streams. A pipe, for instance, routes an ordered stream of data from one process to another. That data might never touch a file.

At the lowest level, Linux data is a series of bits — 1s and 0s. When you’re debugging a problem or simply trying to understand how something works (or more likely, why something doesn’t work), it can help to know exactly what’s in that data, bit-by-bit. The od utility can show you. There are also techniques to compare binary data. Get your shovel ready; let’s dig in. (The October 2004 “Power Tools” column Performing Data Surgery” has some related information.)

Introducing od

The od utility dumps the contents of a file or a data stream (from od’s standard input) to standard output, in a more-readable format. od stands for “octal dump,” but od can also interpret data as ASCII characters and backslashed escapes (its -c option), signed decimal (-d option), hexadecimal (-x), and more. By default, od dumps all of the data. The GNU option -N count dumps only the first count bytes of input.

As a first example, let’s create a file with 100 NUL (zero) bytes. Reading the Linux device /dev/zero returns as many NUL bytes as you want. The obscure dd utility is good at reading an exact number of bytes. We’ll read a block of 100 bytes from /dev/zero and write it to a file named 100zeroes. Then we’ll make a copy of that file named hundred0s and list both of them. (The shell history operator !$ copies all arguments from the previous command line.)

$ dd if=/dev/zero of=100zeroes bs=100 count=1
1+0 records in
1+0 records out
100 bytes transferred in 0.000421 seconds
$ cp 100zeroes hundred0s
$ ls -lG !*
ls -lG 100zeroes hundred0s
-rw-r--r--  1 jpeek 100 ... 100zeroes.gz
-rw-r--r--  1 jpeek 100 ... hundred0s.gz

Trying to display either file with cat or less wouldn’t show much because neither file is a text file. It’s a job for od. Each od output line starts with the file offset, or the number of bytes from the beginning of the file. By default, the offset is shown in octal; the option -Ad uses decimal offsets instead. The option -w6 shows the values of 6 bytes per line. To show octal values, with each value showing one byte, we use -to1. (For full details on all of these options, see info od.)

$ od -w6 -Ad -to1 hundred0s
0000000 000 000 000 000 000 000
*
0000096 000 000 000 000
0000100

The first byte (the zeroth byte past the start of the file) is 000 octal. The next byte is 000 octal, and so on. The first line of od output shows bytes one through six. The next line, starting with * (a star), means “there are more lines like the previous line.” (The -v option shows all lines, without any stars.) The penultimate line shows byte 97 (an offset of 96 bytes from the first byte) through the last byte. The last offset, 0000100 is always on an otherwise-blank line; it indicates that there is no more data.

Another od example

If your data has text in it, use the od -c option. It shows characters instead of numeric values wherever possible. If a byte doesn’t have an ASCII character representation, you’ll see (by default) the octal value instead.

Let’s compress the two NUL-filled files with gzip, then take a look inside the files.

$ gzip 100zeroes
$ gzip -9 hundred0s
$ ls -lG *gz
-rw-r--r--  1 jpeek 34 ... 100zeroes.gz
-rw-r--r--  1 jpeek 34 ... hundred0s.gz

gzip-9 (best compression) processed the second file, but it didn’t make a difference in this case. Each file has 34 bytes. So, what’s in the files? Listing One shows the output of od on each file.

LISTING ONE: Comparing two binary data files using od
$ od -c -w10 100zeroes.gz
0000000 037 213  \b  \b   f   /   û   O  \0 003
0000012   1   0   0   z   e   r   o   e   s  \0
0000024   c   `       =  \0  \0   Ê   Æ 210 231
0000036   d  \0  \0  \0
0000042
$ od -c -w10 hundred0s.gz
0000000 037 213  \b  \b   f   /   û   O  \0 003
0000012   1   0   0   z   e   r   o   e   s  \0
0000024   c   `       =  \0  \0   Ê   Æ 210 231
0000036   d  \0  \0  \0
0000042

(The file offsets default to octal since the command-line omitted the -Ad option.) As you can see, the two files are almost identical except for the 11th through 20th bytes (the lines at offset 000012 octal). That’s where gzip put the original filename. The -c option told od to show characters wherever possible, so those ASCII filenames appear byte-by-byte. The other characters are octal bytes like 037 and escape sequences for non-printable ASCII values, like \b for a backspace character. (If you’d like to know more, see the gzip file format spec.)

Simpler file comparisons with cmp

If you only need to compare two files of the same length, cmp might do the job:

$ cmp 100zeroes.gz hundred0s.gz
100zeroes.gz hundred0s.gz differ: byte 5, line 1
$ cmp -l 100zeroes.gz hundred0s.gz
  5  63  35
  6 275 276
  9   0   2
11  61 150
12  60 165
13  60 156
...

By default, cmp lists the first byte that’s different: the fifth byte (decimal) in line 1. (“Line” here means a string ending with a newline character, which od-c would show as \n. These files have only one line — or, actually, no lines, because there’s no newline.)

The cmp option -l (lowercase “L”) shows all differences, byte-by-byte. The first column is the byte number; the second column is the octal byte value in the first file; and the third column lists the corresponding byte in the second file.

Comparing these to the results from od-c, shown earlier, helps make it clear. For instance, byte 11 in the first file is 61 octal, which is the character 1; this is the first character of the stored filename 100zeroes. The corresponding byte in the second file is 150 octal, which is the character h, the first character of hundred0s.

The man page ascii is a handy reference for this sort of work. The Linux version (on Debian, at least) shows the ASCII characters with their octal, decimal and hex values.

When One Tool Won’t Do the Job

This column almost never covers just one topic in a straight line, and this month is no exception. Although you may never need to make byte-by-byte comparisons, let’s look at another way to do them. The techniques shown next could help you in other cases, too, and with other utilities.

Although most GNU versions of Unix utilities have a lot more features than the originals, there are still jobs they can’t handle. Twenty-five or thirty years ago, when Unix was fairly new, it was common to chain utilities, slicing and dicing the output of one to make input for another. Let’s do that here.

Let’s say you want to compare two files — possibly long ones — byte-by-byte. od can show files byte-by-byte, but comparing two od listings can be tedious, especially if the two files have different lengths. cmp uses a compact format, but if both files aren’t the same length, cmp is even less useful because it can’t track when a series of characters has been inserted or deleted.

$ ls -l file?
-rw-r--r-- 1 jpeek 427 Jan  2 08:39 file1
-rw-r--r-- 1 jpeek 432 Jan  2 08:41 file2
$ cmp file?
file1 file2 differ: char 149, line 1
$ diff file?
Files file1 and file2 differ
$ cmp -lb file?
149   0 ^@   163 s
151   0 ^@    63 3
. 40 more lines.
427  12 ^J     0 ^@
cmp: EOF on file1

When you’re faced with a problem like this, ask what utilities can do a part of the job. The steps to follow might be:

  1. od can show files byte-by-byte. By default, though, its output lines show more than one byte. So, if two files aren’t the same length, the rows of bytes in their od listings can get “out of sync,” making them hard to compare.

    The GNU od has the -w1 option to show one byte per line. (If it didn’t, you could use a utility like sed to replace the space between each byte with a newline character.)

  2. diff can compare two files and display where content has been inserted or deleted. It seems that diff might be able to compare the od dumps of two different-length files — and it definitely can compare two equal-length files.
  3. If you’re comparing two od outputs, and the file contents are offset from each other — that is, if identical byte sequences are at different places in each file — diff will be confused by the offset values in the first field of each od output line, so it won’t be able to spot the identical sequences in the od output. You’ll have to work around that.

The bdiff Script

Let’s see a bash script named bdiff, for “binary diff,” shown in Listing Two. Most of the script is setup, checking the two input files and creating two temporary files with mktemp. Here, let’s just look at the script’s two most important parts. You can download a complete copy of the bdiff script.

LISTING TWO: The initial bdiff script to compare two binary files
#!/bin/bash

myname=${0##*/} # basename of this script

# 6 spaces, regexp for diff, TAB, and 3 spaces:
pattern=' [<|>] '

# Set up temp files:
if [ $# -ne 2 -o ! -r "$1" -o ! -r "$2" ]
then
  echo 1>&2 "Usage: $myname file1 file2
  (Check: both files exist, readable?)"
  exit 1
fi
tmp1=$(mktemp -t $myname.1.XXXXXXXXX) || exit 1
if tmp2=$(mktemp -t $myname.2.XXXXXXXXX)
then
  stat=1 # default exit status; reset later
  trap 'rm -f $tmp1 $tmp2; exit $stat' 0 1 2 15
else
  rm -f $tmp1
  exit 1
fi

# Run od, show characters, one byte per line,
# show all bytes. Then remove offset values:
od -c -w1 -v "$1" | sed 's/^[0-9]*//' > $tmp1
od -c -w1 -v "$2" | sed 's/^[0-9]*//' > $tmp2

# Side-by-side diff, output width 30 columns:
#diff -y -W 30 --suppress-common-lines $tmp1 $tmp2
diff -y -W 30 $tmp1 $tmp2 | grep -C1 "$pattern"
stat=$? # exit with same status as diff

The two lines below run od on each file, then use sed to remove the offset value (the leading digits) from each line of od output. The two temporary files contain the character representations of each byte, one byte per line:

od -c -w1 -v "$1" | sed 's/^[0-9]*//' > $tmp1
od -c -w1 -v "$2" | sed 's/^[0-9]*//' > $tmp2

Next, the script runs diff-y (also known as sdiff) to get side-by-side diff format. The normal side-by-side format makes 130-column lines, but these narrow input lines will fit easily in just 30 columns. The GNU option --suppress-common-lines shows only the lines that differ:

# Side-by-side diff, output width 30 columns:
diff -y -W 30 --suppress-common-lines $tmp1 $tmp2

Let’s run the script. Here are the six differences:

$ bdiff file1 file2
          >    s
          >    3
   \0     |    1
          >    b
          >    6
          >    )

The results show that an s, a digit 3, a b, a 6, and a right parenthesis were inserted. (On each of those lines, the blank left column and the > in the middle column means text was inserted into the second file.) Also, a NUL byte (\0) was replaced with a digit 1.

Revising bdiff

Let’s revise bdiff to show context around the changes. That means removing the diff --suppress-common-lines option. Removing that option gives a lot of output on long files, though. We can’t use the diff-C (“context”) option because it conflicts with -y.

To solve that, let’s add grep-C, which shows context around a match. We’d like to show context around any line where the second column of diff output shows a change: a <, |, or > character.

But what’s in that whitespace around the center column: is it spaces or TABs? Let’s use cat-v-t-e to check the bdiff output. It shows TAB characters as ^I, shows other non-printing characters in a printable way, and adds $ at the end of each line:

$ bdiff file? | cat -v -t -e
^I           >^I   s$
^I           >^I   3$
  \0^I           |^I   1$
^I           >^I   b$
^I           >^I   6$
^I           >^I   )$

The center-column character seems always to have six spaces before it, and a TAB and three spaces after it. (You could do more-careful checking, but this’ll do for now.) Let’s change the bdiff script to store a grep pattern in a shell variable at the start of the script, also edit the diff line to remove the the --suppress-common-lines option and add grep with one line of context:

# 6 spaces, regexp for diff, TAB, and 3 spaces:
pattern='      [<|>]       '
...
diff -y -W 30 $tmp1 $tmp2 | grep -C1 "$pattern"

Let’s run it:

$ bdiff file?
  \0              \0
              >    s
  \0              \0
              >    3
  \0              \0
  \0          |    1
  \0              \0
--
  \0              \0
              >    b
  \0              \0
              >    6
  \0              \0
--
  \0              \0
              >    )
  \0              \0

The output shows that both files have NUL bytes around the changed areas.

And That’s Not All...

You can use that method for other problems. For instance, to compare two directory listings, run ls on each directory, strip out the parts of ls output that you don’t want to compare, then run diff. It’s a handy technique to know.

Jerry Peek is a freelance writer and instructor who has used Unix and Linux for more than 25 years. He's happy to hear from readers; see https://www.jpeek.com/contact.html.

[Read previous article] [Read next article]
[Read Jerry’s other Linux Magazine articles]