The Unknown Power Tool: m4, Part Two

Dig deeper into m4, and look at included files, diversions, frozen files, and debugging and tracing.

Last month’s column presented some basics of the m4 macro processor. m4 scans input text for defined symbols — macros — and replaces those symbols with other text and possibly with other symbols. For example, m4 can convert one language into another.

This month, let’s dig deeper into m4 and look at included files, diversions, frozen files, and debugging and tracing. Along the way, we’ll see some of the rough edges of m4’s minimalist language and explore workarounds. Before we start, though, here’s a warning from the GNU m4 info page:

Some people [find] m4 to be fairly addictive. They first use m4 for simple problems, then take bigger and bigger challenges, learning how to write complex m4 sets of macros along the way. Once really addicted, users pursue writing of sophisticated m4 applications even to solve simple problems, devoting more time debugging their m4 scripts than doing real work. Beware that m4 may be dangerous for the health of compulsive programmers.

So take a deep breath. Good. Now let’s dig in again!

Included Files

m4’s built-in include() macro takes m4’s input from a named file until the end of that file, when the previous input resumes. sinclude() works like include() except that it won’t complain if the included file doesn’t exist.

If an included file isn’t in the current directory, GNU m4 searches the directories specified with the -I command-line option, followed by any directories in the colon-separated M4PATH environment variable.

Including files is often used to read in other m4 code, but can also be used to read plain text files. However, if you’re reading plain text files, watch out for files that contain text that can confuse m4, such as quotes, commas, and parentheses. One way to work around that problem and read the contents of a random file is by using changequote() to temporarily override the quoting characters and also replacing include() with esyscmd(), which filters the file through a Linux utility like tr or sed.

Listing One has a contrived example that shows one way to read /etc/hosts, replacing parentheses with square brackets and commas with dashes.

Listing One: A filtering version of the m4 include() macro
% cat readfile.m4
dnl readfile: display file named on
dnl command line in -Dfile=
dnl converting () to [] and , to -
dnl
file `file on 'esyscmd(`hostname')
changequote({,})dnl
esyscmd({tr '(),' '[]-' <} file)dnl
That's all.
changequote
% cat /etc/hosts
127.0.0.1 localhost
216.123.4.56 foo.bar.com foo
# Following lines are for `IPv6'
# (added automatically, we hope)
::1 ip6-localhost ip6-loopback
.
% m4 -Dfile=/etc/hosts readfile.m4
/etc/hosts file on foo
127.0.0.1 localhost
234.123.4.56 foo.bar.com foo
# Following lines are for `IPv6'
# [added automatically - we hope]
::1 ip6-localhost ip6-loopback
.
That's all.

The option -D or --define lets you define a macro from the command line, before any input files are read. (Later, we’ll see an cleaner way to read text from arbitrary files with GNU m4’s undivert().)

Diversions: An Overview

Normally, all output is written directly to m4’s standard output. But you can use the divert() macro to collect output into temporary storage places. This is one of m4’s handiest features.

The argument to divert() is typically a stream number, the ID of the diversion that should get the output from now on.

*Diversion 0 is the default. Text written to diversion 0 goes to m4’s standard output. If you’ve been diverting text to another stream, you can call divert(0) or just divert to resume normal output.

*Text written to diversions 1, 2, and so on is held until m4 exits or until you call undivert(). (More about that in a moment.)

*Any text written to diversion -1 isn’t emitted. Instead, diversion 1 is “nowhere,” like the Linux pseudo-file /dev/null. It’s often used to comment code and to define macros without using the pesky dnl macro at the ends of lines.

*The divnum macro outputs the current diversion number.

Standard m4 supports diversions 1 through 9, while GNU m4 can handle a essentially unlimited number of diversions. The latter version of m4 holds diverted text in memory until it runs out of memory and then moves the largest chunks of data to temporary files. (So, in theory, the number of diversions in GNU m4 is limited to the number of available file descriptors.)

All diversions 1, 2,., are output at the end of processing in ascending order of stream number. To output diverted text sooner, simply call undivert() with the stream number. undivert() outputs text from a diversion and then empties the diversion. So, immediately calling undivert() again on the same diversion outputs nothing.

“Undiverted” text is output to the current diversion, which isn’t always the standard output! You can use this to move text from one diversion to another. Output from a diversion is not rescanned for macros.

Diverse Diversions

Before looking at the more-obvious uses of numbered diversions, let’s look at a few surprising ones.

As was mentioned, diversion 1 discards output. One of the most irritating types of m4 output is the newline chanacters after macro definitions. You can stop them by calling dnl after each define, but you can also stop them by defining macros after a call to divert(-1).

Here are two examples. This first example, nl, doesn’t suppress the newline from define.

`The result is:’
define(`name', `value')
name

. but the next example, nonl, does, by defining the macro inside a diversion:

`The result is:’
divert(-1)
define(`name', `value')
divert(0)dnl
name

Let’s compare the nl and nonl versions.

$ m4 nl
The result is:
value
$ m4 nonl
The result is:
value

The second divert() ends with dnl, which eats the the following newline. Adding the argument (0), which is actually the default, lets you write dnl without a space before it (which would otherwise be output). You can use divert`'dnl instead, because an empty quoted string (`') is another way to separate the divert and dnl macro calls.

Of course, that trick is more reasonably done around a group of several define s. You can also write comments inside the same kind of diversion. This is an easy way to write blocks of comments without putting dnl at the start of each line. Just remember that macros are recognized inside the diversion (even though they don’t make output). So, the following code increments i twice:

divert(-1)
Now we run define(`i', incr(i)):
define(`i', incr(i))
divert`'dnl

dnl can start comments, and that works on even the oldest versions of m4. Generally, # is also a comment character. If you put it at the start of the comment above, as in #Now., then i won’t be incremented.

Before seeing the “obvious” uses of diversions, here’s one last item from the bag of diversion tricks. GNU m4 lets you output a file’s contents by calling undivert() instead of include(). The advantage is that, like undiverting a diversion, “undiverting” a file doesn’t scan the file’s contents for macros. This lets you avoid the really ugly workaround showed in Listing One. With GNU m4, you could have written simply:

undivert(`/etc/hosts')

Diversions as Diversions

The previous section showed some offbeat uses of divert(). Now let’s see a more obvious use: splitting output into parts and reassembling those parts in a different order.

Listing Two, Three, and Four show a HTML generator that outputs the text of each top-level heading in two places: in a table of contents at the start of the web page, and again, later, in the body of the web page. The table of contents includes links to the actual headings later in the document, which will have an anchor (an HTML id).

Listing Two has the file, htmltext.m4, with the macro calls. Listing Three shows the HTML output from the macros (which omits the blank lines, because HTML parsers ignore them). Listing Four shows the macros, which call include() to bring in the htmltext.m4 file at the proper place. (Blank lines have been added to the macros to make the start and end of each macro more obvious.)

Listing Two: m4 macro calls: the htmltext.m4 file
_h1(`First heading')
_p(`The first paragraph.')
_h1(`Second heading')
_p(`The second paragraph.
Yadda yadda yadda')
_h1(`Third heading')
_p(`The third paragraph.')
Listing Three: HTML output from the htmltext.m4 file
<strong>Table of contents:</strong>
<ol>
<li><a href="#H1_1">First heading</a></li>
<li><a href="#H1_2">Second heading</a></li>
<li><a href="#H1_3">Third heading</a></li>
</ol>
<h1 id="H1_1">First heading</h1>
<p>
The first paragraph.
</p>
<h1 id="H1_2">Second heading</h1>
<p>
The second paragraph.
Yadda yadda yadda
</p>
<h1 id="H1_3">Third heading</h1>
<p>
The third paragraph.
</p>
Listing Four: The m4 code that makes the HTML in Listing Three
define(`_h1count', 0)
define(`_h1', `divert(9)
define(`_h1count', incr(_h1count))
<li><a href=“`#'H1`_'_h1count”>$1</a></li>
divert(1)
<h1 id=“H1`_'_h1count”>$1</h1>
divert')
define(`_p', `divert(1)
<p>
$1
</p>
divert')
include(`htmltext.m4')
<strong>Table of contents:</strong>
<ol>
undivert(9)
</ol>
undivert(1)

Let’s look at the code in Listing Four.

*The _h1count macro sets the number used at the end of each HTML id. It’s incremented by a define call inside the _h1 macro.

*The _h1 (heading level 1) macro starts by calling divert(9). The code used diversion 9 to store the HTML for the table of contents. After incrementing _h1count, the macro outputs a list item surrounded by <li> and </li> tags. (The <ol> tags come later: when the code undiverts diversion 9.) Notice that the # is quoted to keep it from being treated as an m4 comment character. In the same way, the underscore is quoted (`_'), since it’s used as part of the HTML id string (for instance, href="#H1_2").

A final call to divert switches output back to the normal diversion 0, which is m4’s standard output.

*The _p (paragraph) macro is straightforward. It stores a pair of <p> tags with the first macro argument in-between in diversion 1.

*A call to include() brings in the file htmltext.m4 (Listing Two). This could have done this in several other ways, on the m4 command line, for instance.

*Finally, the call undivert(9) outputs the table of contents surrounded by a pair of ordered-list tags, followed by the headers and paragraphs from undivert(1).

This example shows one use of diversions: to output text in more than one way. Another common use — in sendmail, for instance — is gathering various text into “bunches” by its type or purpose.

Frozen Files

Large m4 applications can take time to load. GNU m4 supports a feature called frozen files that speeds up loading of common base files. For instance, if your common definitions are stored in a file named common.m4, you can pre-process that file to create a frozen file containing the m4 state information:

$ m4 -F common.m4f common.m4

Then, instead of using m4 common.m4, you use m4 -R common.m4f for faster access to the common definitions.

Frozen files work in a majority of cases, but there are gotchas. Be sure to read the m4 info file (type info m4) before you use this feature.

Debugging and Tracing

m4’s recursion and quoting can make debugging a challenge. A thorough understanding of m4 helps, of course, and the techniques shown in the next section are worth studying. Here are some built-in debugging techniques:

*To see a macro definition, use dumpdef(), which was covered last month. dumpdef() shows you what’s left after the initial layer of quoting is stripped off of a macro definition and any substitutions are made.

*The traceon() macro traces the execution of the macros you name as arguments, or, without a list of macros, it traces all macros. The trace output shows the depth of expansion, which is typically 1, but can be greater if a macro contains macro calls. Use traceoff to stop tracing.

*The debugmode() macro gives you a lot of control over debugging output. It accepts a string of flags, which are described in the m4 info file. You can also specify debugging flags on the command line with -d or --debug. These flags also affect the output of dumpdef() and traceon().

More about m4

Last month and this, you’ve seen some highlights of m4. If you have the

GNU version of m4, its info page (info m4) is a good place to learn more.

R.K. Owen’s quoting page (http://owen.sj.ca.us/rkowen/howto/webpaging/m4tipsquote.html) has lots of tips about — what else — quoting in m4. His site also has other m4 information and examples.

Ken Turner’s technical report “CSM-126: Exploiting the m4 Macro Language,” available from http://www.cs.stir.ac.uk/research/publications/techreps/previous.html, shows a number of m4 techniques.

*A Google search for m4 macro turns up a variety of references. To find example code, try a search with an m4-specific macro name, like m4 dnl and m4 divert -motorway. (In Google, the -motorway avoids matches of the British road named the M4. You can also add -sendmail to skip sendmail-specific information.)

*Mailing lists about m4 are at http://savannah.gnu.org/mail/?group=m4.

Happy m4 hacking!

Jerry Peek is a freelance writer and instructor who has used Unix and Linux for over 20 years. He’s happy to hear from readers; see https://www.jpeek.com/contact.html.

[Read previous article] [Read next article]
[Read Jerry’s other Linux Magazine articles]