If a standard Linux utility doesn’t do quite what you want, adapt it.
A main source of data on a Linux system is a file. Linux data also moves in streams. A pipe, for instance, routes an ordered stream of data from one process to another. That data might never touch a file.
At the lowest level, Linux data is a series of bits — 1s and 0s. When you’re debugging a problem or simply trying to understand how something works (or more likely, why something doesn’t work), it can help to know exactly what’s in that data, bit-by-bit. The od utility can show you. There are also techniques to compare binary data. Get your shovel ready; let’s dig in. (The October 2004 “Power Tools” column Performing Data Surgery” has some related information.)
The od utility dumps the contents of a file or a data stream (from od’s standard input) to standard output, in a more-readable format. od stands for “octal dump,” but od can also interpret data as ASCII characters and backslashed escapes (its -c
option), signed decimal (-d
option), hexadecimal (-x
), and more. By default, od dumps all of the data. The GNU option -N count
dumps only the first count bytes of input.
As a first example, let’s create a file with 100 NUL (zero) bytes. Reading the Linux device /dev/zero returns as many NUL bytes as you want. The obscure dd utility is good at reading an exact number of bytes. We’ll read a block of 100 bytes from /dev/zero and write it to a file named 100zeroes. Then we’ll make a copy of that file named hundred0s and list both of them. (The shell history operator !$
copies all arguments from the previous command line.)
$ dd if=/dev/zero of=100zeroes bs=100 count=1 1+0 records in 1+0 records out 100 bytes transferred in 0.000421 seconds $ cp 100zeroes hundred0s $ ls -lG !* ls -lG 100zeroes hundred0s -rw-r--r-- 1 jpeek 100 ... 100zeroes.gz -rw-r--r-- 1 jpeek 100 ... hundred0s.gz
Trying to display either file with cat or less wouldn’t show much because neither file is a text file. It’s a job for od. Each od
output line starts with the file offset, or the number of bytes from
the beginning of the file. By default, the offset is shown in octal;
the option -Ad
uses decimal offsets instead. The option -w6
shows the values of 6 bytes per line. To show octal values, with each value showing one byte, we use -to1
. (For full details on all of these options, see info od
.)
$ od -w6 -Ad -to1 hundred0s 0000000 000 000 000 000 000 000 * 0000096 000 000 000 000 0000100
The first byte (the zeroth byte past the start of the file) is 000
octal. The next byte is 000
octal, and so on. The first line of od output shows bytes one through six. The next line, starting with *
(a star), means “there are more lines like the previous line.” (The -v
option shows all lines, without any stars.) The penultimate line shows
byte 97 (an offset of 96 bytes from the first byte) through the last
byte. The last offset, 0000100
is always on an otherwise-blank line; it indicates that there is no more data.
If your data has text in it, use the od -c
option. It shows characters instead of numeric values wherever
possible. If a byte doesn’t have an ASCII character representation,
you’ll see (by default) the octal value instead.
Let’s compress the two NUL-filled files with gzip, then take a look inside the files.
$ gzip 100zeroes $ gzip -9 hundred0s $ ls -lG *gz -rw-r--r-- 1 jpeek 34 ... 100zeroes.gz -rw-r--r-- 1 jpeek 34 ... hundred0s.gz
gzip-9
(best compression) processed the second file,
but it didn’t make a difference in this case. Each file has 34 bytes.
So, what’s in the files? Listing One shows the output of od on each file.
$ od -c -w10 100zeroes.gz 0000000 037 213 \b \b f / û O \0 003 0000012 1 0 0 z e r o e s \0 0000024 c ` = \0 \0 Ê Æ 210 231 0000036 d \0 \0 \0 0000042 $ od -c -w10 hundred0s.gz 0000000 037 213 \b \b f / û O \0 003 0000012 1 0 0 z e r o e s \0 0000024 c ` = \0 \0 Ê Æ 210 231 0000036 d \0 \0 \0 0000042
(The file offsets default to octal since the command-line omitted the -Ad
option.) As you can see, the two files are almost identical except for the 11th through 20th bytes (the lines at offset 000012
octal). That’s where gzip put the original filename. The -c
option told od to show characters wherever possible, so those ASCII filenames appear byte-by-byte. The other characters are octal bytes like 037
and escape sequences for non-printable ASCII values, like \b
for a backspace character. (If you’d like to know more, see the gzip file format spec.)
If you only need to compare two files of the same length, cmp might do the job:
$ cmp 100zeroes.gz hundred0s.gz 100zeroes.gz hundred0s.gz differ: byte 5, line 1 $ cmp -l 100zeroes.gz hundred0s.gz 5 63 35 6 275 276 9 0 2 11 61 150 12 60 165 13 60 156 ...
By default, cmp lists the first byte that’s different: the
fifth byte (decimal) in line 1. (“Line” here means a string ending with
a newline character, which od-c would show as \n
. These files have only one line — or, actually, no lines, because there’s no newline.)
The cmp option -l
(lowercase “L”) shows all
differences, byte-by-byte. The first column is the byte number; the
second column is the octal byte value in the first file; and the third
column lists the corresponding byte in the second file.
Comparing these to the results from od-c
, shown earlier, helps make it clear. For instance, byte 11 in the first file is 61
octal, which is the character 1
; this is the first character of the stored filename 100zeroes
. The corresponding byte in the second file is 150
octal, which is the character h
, the first character of hundred0s
.
The man page ascii is a handy reference for this sort of work. The Linux version (on Debian, at least) shows the ASCII characters with their octal, decimal and hex values.
This column almost never covers just one topic in a straight line, and this month is no exception. Although you may never need to make byte-by-byte comparisons, let’s look at another way to do them. The techniques shown next could help you in other cases, too, and with other utilities.
Although most GNU versions of Unix utilities have a lot more features than the originals, there are still jobs they can’t handle. Twenty-five or thirty years ago, when Unix was fairly new, it was common to chain utilities, slicing and dicing the output of one to make input for another. Let’s do that here.
Let’s say you want to compare two files — possibly long ones — byte-by-byte. od can show files byte-by-byte, but comparing two od listings can be tedious, especially if the two files have different lengths. cmp uses a compact format, but if both files aren’t the same length, cmp is even less useful because it can’t track when a series of characters has been inserted or deleted.
$ ls -l file? -rw-r--r-- 1 jpeek 427 Jan 2 08:39 file1 -rw-r--r-- 1 jpeek 432 Jan 2 08:41 file2 $ cmp file? file1 file2 differ: char 149, line 1 $ diff file? Files file1 and file2 differ $ cmp -lb file? 149 0 ^@ 163 s 151 0 ^@ 63 3 . 40 more lines. 427 12 ^J 0 ^@ cmp: EOF on file1
When you’re faced with a problem like this, ask what utilities can do a part of the job. The steps to follow might be:
The GNU od has the -w1
option to show one byte per line. (If it didn’t, you could use a utility like sed to replace the space between each byte with a newline character.)
Let’s see a bash script named bdiff, for “binary diff,” shown in Listing Two. Most of the script is setup, checking the two input files and creating two temporary files with mktemp. Here, let’s just look at the script’s two most important parts. You can download a complete copy of the bdiff script.
#!/bin/bash myname=${0##*/} # basename of this script # 6 spaces, regexp for diff, TAB, and 3 spaces: pattern=' [<|>] ' # Set up temp files: if [ $# -ne 2 -o ! -r "$1" -o ! -r "$2" ] then echo 1>&2 "Usage: $myname file1 file2 (Check: both files exist, readable?)" exit 1 fi tmp1=$(mktemp -t $myname.1.XXXXXXXXX) || exit 1 if tmp2=$(mktemp -t $myname.2.XXXXXXXXX) then stat=1 # default exit status; reset later trap 'rm -f $tmp1 $tmp2; exit $stat' 0 1 2 15 else rm -f $tmp1 exit 1 fi # Run od, show characters, one byte per line, # show all bytes. Then remove offset values: od -c -w1 -v "$1" | sed 's/^[0-9]*//' > $tmp1 od -c -w1 -v "$2" | sed 's/^[0-9]*//' > $tmp2 # Side-by-side diff, output width 30 columns: #diff -y -W 30 --suppress-common-lines $tmp1 $tmp2 diff -y -W 30 $tmp1 $tmp2 | grep -C1 "$pattern" stat=$? # exit with same status as diff
The two lines below run od on each file, then use sed to remove the offset value (the leading digits) from each line of od output. The two temporary files contain the character representations of each byte, one byte per line:
od -c -w1 -v "$1" | sed 's/^[0-9]*//' > $tmp1 od -c -w1 -v "$2" | sed 's/^[0-9]*//' > $tmp2
Next, the script runs diff-y
(also known as sdiff) to get side-by-side diff
format. The normal side-by-side format makes 130-column lines, but
these narrow input lines will fit easily in just 30 columns. The GNU
option --suppress-common-lines
shows only the lines that differ:
# Side-by-side diff, output width 30 columns: diff -y -W 30 --suppress-common-lines $tmp1 $tmp2
Let’s run the script. Here are the six differences:
$ bdiff file1 file2 > s > 3 \0 | 1 > b > 6 > )
The results show that an s
, a digit 3
, a b
, a 6
, and a right parenthesis were inserted. (On each of those lines, the blank left column and the >
in the middle column means text was inserted into the second file.) Also, a NUL byte (\0
) was replaced with a digit 1
.
Let’s revise bdiff to show context around the changes. That means removing the diff --suppress-common-lines
option. Removing that option gives a lot of output on long files, though. We can’t use the diff-C
(“context”) option because it conflicts with -y
.
To solve that, let’s add grep-C
, which shows context around a match. We’d like to show context around any line where the second column of diff output shows a change: a <
, |
, or >
character.
But what’s in that whitespace around the center column: is it spaces or TABs? Let’s use cat-v-t-e
to check the bdiff output. It shows TAB characters as ^I
, shows other non-printing characters in a printable way, and adds $
at the end of each line:
$ bdiff file? | cat -v -t -e ^I >^I s$ ^I >^I 3$ \0^I |^I 1$ ^I >^I b$ ^I >^I 6$ ^I >^I )$
The center-column character seems always to have six spaces before
it, and a TAB and three spaces after it. (You could do more-careful
checking, but this’ll do for now.) Let’s change the bdiff script to store a grep pattern in a shell variable at the start of the script, also edit the diff line to remove the the --suppress-common-lines
option and add grep with one line of context:
# 6 spaces, regexp for diff, TAB, and 3 spaces: pattern=' [<|>] ' ... diff -y -W 30 $tmp1 $tmp2 | grep -C1 "$pattern"
Let’s run it:
$ bdiff file? \0 \0 > s \0 \0 > 3 \0 \0 \0 | 1 \0 \0 -- \0 \0 > b \0 \0 > 6 \0 \0 -- \0 \0 > ) \0 \0
The output shows that both files have NUL bytes around the changed areas.
You can use that method for other problems. For instance, to compare two directory listings, run ls on each directory, strip out the parts of ls output that you don’t want to compare, then run diff. It’s a handy technique to know.
Jerry Peek is a freelance writer and instructor who has used Unix and Linux for more than 25 years. He's happy to hear from readers; see https://www.jpeek.com/contact.html.
[Read previous article]
[Read next article]
[Read Jerry’s other Linux Magazine articles]