It's (Not) Magic

Unlike operating systems that require an “extension” like .doc or .txt to identify file types and what application should open them, Linux systems generally don’t. Some applications do: C compilers, for instance, expect source filenames ending with .c. But the kernel doesn’t enforce this.

Instead of filename extensions, Linux uses a system of magic numbers: the first few bytes stored in an executable file tell how to run it. There’s no need to name a shell script foo.sh; plain foo is fine. (If you need to know, the command file foo will usually tell you.)

This system of identifying a file type by its content isn’t used only for executable files. It’s common in many other types of files, as we’ll see.

How is this all defined? And, armed with this knowledge, what can you do with it? Let’s dig in.

Checking File Types with file(1)

Over the years, there’ve been several versions of the file utility. We’ll look at the version that uses external data files named magic to characterize files.

When you pass file one or more filenames, it starts by checking the file’s inode. If the file is a directory, symbolic link, or other special file, it will tell you so:

$ file /dev/tty /tmp /vmlinuz
/dev/tty: character special (5/0)
/tmp:     sticky directory
/vmlinuz: symbolic link to `boot/vmlinuz-2.4.26'

Otherwise, file looks into the file, comparing the file contents to patterns in the magic files:

$ file at autoconf autoscan
at:        setuid ELF 32-bit LSB executable,
           Intel 80386, version 1 (SYSV),
	   for GNU/Linux 2.0.30, dynamically
	   linked (uses shared libs), stripped
autoconf:  Bourne shell script text executable
autoscan:  perl script text executable

The magic(5) manual page describes the magic file format. The manpage is more of a reference than an introduction so, if you haven’t used file before, you might start by looking at the magic files directly. There’s usually a well-commented standard file, supplied with file, in a location like /usr/share/misc/magic. A local file, such as /etc/magic, lets you define other file types.

Digging A Little Deeper: ELF files

Many of the executable files on a Linux system are in Executable and Linking Format (ELF). ELF is similar to, but more flexible than, the older a.out and COFF binary formats. Types of ELF files include executable, shared object, relocatable, and core dump.

All ELF files start with a four-character magic number, as this entry from the magic file shows:

0 string \177ELF ELF

That is (reading from left to right), at an offset of 0 (the first byte in the file), an ELF file has the string \177ELF (octal 177 followed by the characters E, L, and F). When Linux executes a file (as described in the execve(2) manpage), this magic number tells the kernel that the file is in ELF-format.

The file utility can tell you some of this, but the more-specialized readelf utility can tell you much more about ELF files

$ readelf -h at
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 .
  Class:    ELF32
  Data:     2's complement, little endian
  Type:     EXEC (Executable file)
  Machine:  Intel 80386
  .

-h shows the ELF header, which always starts at the beginning of the file and includes the ELF magic number that tells Linux this is an ELF file. The Magic data shows the file’s first sixteen bytes in hex format. The first four bytes are always 7f hex, followed by the three characters E, L, and F.

To see the printable bytes as characters, use the od utility with its -c (“character”) option, which shows non-text bytes in octal. Let’s add -N 4 to see just the first four bytes:

$ od -N 4 -c at
0000000 177   E   L   F
0000004

As you can see, the first four bytes of at, an ELF file, compose the ELF magic number.

There’s another common magic number, which we’ll look at next.

A Magic Number for Interpreters

Linux can run a script file — interpreted by a shell, or Perl, or almost any other interpreter — without requiring a special filename extension like .sh or .pl. This happens when you start a file with the “magic number” #!.

As the execve(2) manpage explains, when you start a file with the two characters #!, the kernel looks at the rest of that first line for the full pathname of an interpreter program — and, possibly, some options and arguments for the interpreter. The kernel tacks on the pathname of the script file, then executes that interpreter program.

Let’s look at a simple script file named makeold:

#!/bin/zsh --extended_glob
# makeold - rename * to OLD-*
for f in ^OLD-*
do
  test -f "$f" && mv -i "$f" "OLD-$f"
done

When you run that script by typing its name — for instance, ./makeold — the command executed is:

/bin/zsh --extended_glob ./makeold

If the script accepted command-line arguments and you had typed ./makeold -a b, the command executed would be:

/bin/zsh --extended_glob ./makeold -a b

This #! technique lets you choose any interpreter to read a file. (We’re using the Z shell here because it correctly handles filenames containing spaces and because its extended_glob option lets us expand filenames that don’t match a pattern. You can read more about zsh in the February 2004 article “Catching Some ZZZs.”)

There’s one important catch to the #! magic number, though: the interpreter that reads the file should ignore lines starting with #. (The interpreter simply gets the script filename as an argument; the first line the interpreter reads is the #!/interpreter-path, which execve(2) expects to choose the interpreter.) This is usually okay because most interpreters — most scripting languages — treat a line starting with # as a comment.

A FILE THAT SHOWS ITSELF

The magic number #! may seem like a bit of magic, but it’s actually straightforward. To demonstrate this, here’s a two-line script named showme:

#!/bin/cat -n
this is a line in the showme file

(The cat option -n numbers each output line.) When you put those lines in a file and execute it, you’ll see:

$ ./showme
     1  #!/bin/cat -n
     2  this is a line in the showme file

The system is running /bin/cat -n ./showme, so cat reads the file ./showme and shows each line with a line number before it. (cat isn’t an interpreter, but the Linux kernel doesn’t care: it simply runs the “interpreter” and passes the arguments to it.)

As another example, here’s a file named xhelp that displays itself by using the less pager:

#!/usr/bin/less +2
This is the help file for whatever.
To read the next page, press SPACE.
...

The less option +2 tells less to display line 2 at the top of the screen. So, users who execute xhelp won’t see the line #!/usr/bin/less +2 on their screens — unless they tell less to scroll up a line.

More Magic

As we mentioned, many standard files also start with a sequence of bytes to identify the file type. An easy way to find these is by searching through the magic file for the file(1) utility. For example, let’s look at the magic file’s description of a ZIP file. You can use this data to identify files:

0   string   PK\003\004   Zip archive data
>4   byte   0x09   \b, at least v0.9 to extract
>4   byte   0x0a   \b, at least v1.0 to extract
>4   byte   0x0b   \b, at least v1.1 to extract
>4   byte   0x14   \b, at least v2.0 to extract

A Perl script wouldn’t need to run file(1). It could read the first four bytes of a file to confirm that the file is in ZIP format. If the first four bytes aren’t the letter P, the letter K, octal 3, and octal 4, then it’s not a ZIP file. (The bytes after the first four tell what ZIP format is used: the hex byte 0b, for instance, means the file requires ZIP version 1.1.)

It’s not as easy to test non-printable characters in a Bourne-type shell script. One useful technique is to have od read the file and test its output. For instance, a ZIP file should make output like this:

$ od -N 4 -c somefile
0000000   P   K 003 004
0000004

Your script could read and check that data as shown in the following Bourne shell script fragment testing for a ZIP file:

# Translate first four bytes into characters:
decoded=`od -N 4 -c "$inputfile"`

# What $decoded should contain for a ZIP file:
expected="0000000   P   K 003 004
0000004"

if [ "$decoded" != "$expected" ]; then
   echo "myname: $inputfile isn't a ZIP file?" 1>&2
   exit 1
fi

Note that both $decoded and $expected contain an embedded newline (each variable holds two lines of text). This isn’t a problem in Bourne shell scripts if you’re careful with your quoting.

MIME Magic, Part 1

A fairly new feature of file is testing a file and outputting its MIME type. The -i option tells file to use the alternate definitions in the magic.mime file. This can come in handy for testing file types from a script. It’s also useful for building MIME messages on-the-fly, when you need the correct value for a Content-Type: header field.

As an example, let’s see a simple shell script that MIME-encodes a file and emails it. This script is far from “bulletproof”: for instance, it doesn’t check to be sure that file returns a file type. It’s meant as much to show a couple of useful shell techniques as to show a use for file -i. (The -b option tells file to output only the MIME type, not the filename.)

mailer="/usr/sbin/sendmail -t"
mimetype=`file -b -i "$file"`

# Note empty line at end of header:
header="From: Jerry Peek <jpeek@jpeek.com>
To: $to
Subject: The $file file
MIME-Version: 1.0
Content-Type: $mimetype; name=\"$file\"
Content-Disposition: attachment;
  filename=$file
Content-Transfer-Encoding: base64
"

# Send header then body to mailer's stdin:
(echo "$header"; mimencode "$file") | $mailer

The header variable contains multiple lines. The last line is empty; this makes the required empty line at the end of a mail message header. The contents of $header and the result of mimencode (which base64-encodes $file) are combined using subshell operators; this gives the message header, a blank line, and the message body. The output of the subshell is piped to a mailer program which expects a fully-formed message (header and body both) — here we’re using sendmail.

(We saw a similar technique in the May 2004 column “Great Command-line Combinations.” In that example, we were building the body of an email message with the output of several commands in a subshell. In this case, we’re building both the header and body of an email message.)

MIME Magic, Part 2

Just as you can use file to check a file type before MIME-encoding the file, you can also check a file type after decoding a MIME message. This may be a good idea in any sort of system that extracts or executes a MIME message part from an unknown sender. Spammers and crackers may use incorrect Content-Type: fields and/or filename extensions to sneak misleading content into the recipient’s system. Testing the file’s magic number after extracting it can catch this mischief.

It’s also sometimes possible to test the encoded MIME body part directly. This means you don’t need to decode the MIME part or to run file; you simply need to do a pattern match against (typically) the first line of the MIME part. Here’s an example, the first few lines of a MIME-encoded ZIP archive:

Content-Disposition: attachment;
  filename="file.zip"

UEsDBBQAAAAIALaSXS2iLX6dSwAAAGY...
yVcoSk1MUcjPy6nkSszJTCxW0C0HqgB...
+kq1y7kAUEsDBAoAAAAAAHmSXS0AAAA...

The base64 encoding scheme encodes each three bytes (octets) as four printable ASCII characters: numbers, letters, +, and /. This is a linear encoding, so you know that the first three bytes of the file will be represented by the first four characters. In the example above, the first three bytes of the magic number (P, K, and octal 3) are encoded as UEsD. So, if the first line of the body part doesn’t start with UEsD, this isn’t a ZIP file!

Whether this scheme works for you depends on the length of the magic number. If checking three bytes is enough, you’re all set. If you want to test all four bytes, though, you need to check the first six characters of the encoded body part. (The base64 encoding scheme breaks each 24 input bits into four 6-bit groups. Each of those groups is translated into a single base64-encoded character. So the first input byte in a group of three bytes is represented by the first two encoded characters.) Those characters will vary, of course, depending on what the fifth byte of the original file contains.

You can look up the base64 encoding scheme in RFC 1521. A quicker way, though, may be to encode the bytes and see what you get. Let’s send the first 3, 4, 5, and 6 bytes to mimencode:

$ head -c3 file.zip | mimencode
UEsD
$ head -c4 file.zip | mimencode
UEsDBA==
$ head -c5 file.zip | mimencode
UEsDBBQ=
$ head -c6 file.zip | mimencode
UEsDBBQA

(An = at the end of the string is a padding character.) It this technique worthwhile? That’s up to you to decide. On a busy mail server, doing a pattern match on one line, instead of decoding the entire body part, can save time.

Summary

Magic numbers are used in a lot of situations. They — or other identifying strings buried in a file — can tell you a lot about the file contents. Reading through the magic configuration file for the file utility can show you a lot of this magic. So can a utility like od with the correct option to represent file contents as characters, octal, or hex.

Jerry Peek is a freelance writer and instructor who has used Unix and Linux for nearly 25 years. He's happy to hear from readers; see https://www.jpeek.com/contact.html.