This month, as another part of the series about using text on Linux systems, we’ll introduce “plain text” and how you can restructure it. We’ll see how to identify text from different systems (Unix, DOS, Mac) and to convert text between systems. The article ends with some examples, and there’ll be lots more next month.
If you’re used to clicking on files to view and edit them, you’ll probably find some new tools and concepts here. Gurus, please have a look at the main example and be sure it’s familiar.
The fundamental concept is the role of the newline (line feed) character. Reformatting text is basically a matter of juggling newlines. Let’s dig in.
As I wrote in last month’s column, Linux runs on text. Text comes in a lot of flavors. What we’ll cover this month is plain text: a stream of characters that you can output directly to a terminal window using a utility like cat(1). The text doesn’t have special formatting codes like “start boldface” or “24 pixels high” that are only understood by certain operating systems or applications. Plain text doesn’t require a word processing program like OpenOffice.org Writer to interpret instructions buried before and between the actual text.
Let’s make a test text file. We’ll use it to demonstrate a lot of things about text. Although the example is a bit tedious, the techniques will be useful, later, when you need to know what’s in a text file. We’ll make this text file on a Windows system, and make a similar file later under Linux.
On Microsoft Windows systems, each line of a plain text file ends with a carriage return (CR) character followed by a newline (LF, line feed) character. Let’s use a “Command prompt” window to copy text typed from the keyboard (con:) into a Windows-format file named win.txt. On DOS-type systems, pressing CTRL-Z followed by the ENTER key ends input. The boldfaced text is user input; the rest is system output:
D:\tmp>copy con: win.txt line 1 line 2^Z 1 file(s) copied. D:\tmp>
Next, let’s read that file from a Linux-type terminal window — for instance, under Cygwin or from a Windows filesystem cross-mounted onto a Linux host. We’ll look at the file with cat, cat -tve, and od -c; and count the number of lines, words, and characters using wc(1). (For an introduction, see the section “od and friends” of Wizard Boot Camp, Part 10.)
/d/tmp$ cat win.txt line 1 line 2/d/tmp$ /d/tmp$ cat -tve win.txt ^Iline 1 ^M$ line 2/d/tmp$ /d/tmp$ od -c -w6 win.txt 0000000 \t l i n e 0000006 1 \r \n l i 0000014 n e 2 0000020 /d/tmp$ wc win.txt 1 4 16 win.txt /d/tmp$
Here are details. (You might open another copy of this page in a separate window so you can see the example while reading the details.)
Why? It’s because (as we’ll see below) the contents of win.txt don’t end with a newline character. The shell always prints a prompt immediately after the output of a command (in this case, the cat utility) finishes. There’s no newline at the end of the file, so the shell prompt appears on the same line.
Each line of a DOS text file ends with two characters: a carriage return and a newline (line feed). After showing the carriage return visibly, cat output the newline preceded by a $ character:
od shows the structure of a text file. The newline character — the end of the first “line” in the file — is just a character. The bytes of the next “line” start immediately after the newline. (As we’ll see later, you can insert newline characters anywhere you want to start new lines.)
The previous section showed a text file from a Windows system. Each complete line in a Windows file ends with CRLF (carriage return and line feed characters).
On Unix and Linux systems, each line of a text file ends with a newline (LF, line feed) character. To see that, we’ll repeat the previous example on a Linux box. Let’s use the built-in bash utility echo -en this time. (If you aren’t using the bash shell, try typing /bin/echo instead of echo.) The -e option tells echo to convert escape sequences into characters; for instance, when echo reads \t, it echoes a TAB character. The -n option suppresses the final newline character:
/tmp$ echo -en "\tline 1 \nline 2" > lin /tmp$ cat lin line 1 line 2/tmp$ cat -tve lin ^Iline 1 $ line 2/tmp$ od -c -w6 lin 0000000 \t l i n e 0000006 1 \n l i n 0000014 e 2 0000017 /tmp$ wc lin 1 4 15 lin /tmp$
The difference between the previous Windows example and this one is that there’s no CR character here.
What about other OSes? Old Macintosh files use CR as a line terminator. New Macs, with Unix underneath, use the Unix LF terminator. (If you’d like to read more, here’s a Wikipedia article about “Newline”.)
Utilities like dos2unix, unix2dos, fromdos and todos add or remove a CR before each LF, as needed. Here’s another way to remove all CR (octal 15) characters from a file:
/tmp$ tr -d '\015' < dos_file > linux_file
If a line happens to have a carriage return in the middle of it, that tr method will strip those too. (For instance, to boldface a line on typewriter-like printers, make the printer’s carriage move to the start of the line and reprint the line several times, as in line^Mline^Mline.) To remove CR only at the end of a line, try sed; use \r in a regular expression to match CR and $ to match end-of-line:
/d/tmp$ sed 's/\r$//' win.txt | od -c -w6 0000000 \t l i n e 0000006 1 \n l i n 0000014 e 2 0000017
A TAB is a single character. On typewriters (remember them?) and printers with a moving head, a TAB moves the current position to the next tabstop location. Tabstops are typically 8 characters apart. For example, let’s say that the first position on a line is position 0, the second is position 1, and so on. So, if the cursor, printer head, etc., is in positions 0 through 7, a TAB will move to position 8. If the current position is in 8 through 15, a TAB moves to 16. We can see this using echo to output two lines. The X in the second line appears underneath the 8 in the first line, which is the next tabstop:
/tmp$ echo -e "0123456789\n..\tX" 0123456789 .. X
These days, TABs are more useful as field separators. For instance, if you save a spreadsheet to a plain-text file, one row per line, with a TAB between each cell from a row, you can use powerful Linux text utilities to extract or rearrange the data. We’ll see some examples next month.
Once you know how text files are structured, you can rearrange the newlines. Want to doublespace some text — that is, add a blank line after every line? At the end of each line, add another newline. Let’s do it with sed. First, make a two-line test file:
/tmp$ echo -e "Line A\nLine B" > sed.in /tmp$ cat sed.in Line A Line B /tmp$ sed 's/$/\ > /' sed.in > sed.out /tmp$ cat -e sed.out Line A$ $ Line B$ $
(When you type a multiline command at a shell prompt, bash uses its secondary prompt > until the command is complete.) cat -e shows that sed.out has an empty line after each line.
What argument is sed actually getting? To find out, use echo -En to write the same argument to od -c. Using -E makes sure echo doesn’t interpret the backslash:
/tmp$ echo -En 's/$/\ > /' | od -c 0000000 s / $ / \ \n / 0000007
od shows that, after the backslash at the end of the first line (which is required so sed knows that its s command hasn’t finished), sed gets a newline (LF) character on the replacement side of the s command. So, at each existing newline, sed is adding another newline.
How can you break an input with many lines into “chunks” of, say, 56 lines each? The pr(1) utility does this; it outputs a file or standard input in 56-line chunks, with a 5-line header before each chunk and a five-line footer after each chunk. (This fits printers that print 66 lines per 11-inch page.) If you have any long set of data, a technique like this can cut it into more-manageable pieces.
A shell loop with redirected input can do the same thing as pr, and the technique is worth knowing when you need chunks of text. Listing One shows an example. The files named on the chunk56 command line — or the standard input, if no files are named — are written to the loop’s standard input. Each call to head -56 reads the next 56 lines of the loop’s standard input, which the shell stores in $chunk. Three echo commands output a 5-line header, the body, and a 5-line footer. Once $chunk is empty, the endless loop (driven by the shell’s colon operator) is broken.
Listing One: chunk56: outputs 66-line pages
#!/bin/bash page=1 cat "$@" | while : do chunk=$(head -56) if [[ -n $chunk ]] then echo -e "\n\nPAGE $page\n\n" echo -E "$chunk" echo -e "\n\n\n\n" let page=page+1 else break fi done
Two tools for doing similar “chunking” — but writing the input to individual files instead of to standard output — are split(1) and csplit(1). As a quick example, here’s how you could split the output of someprog into a subdirectory named chunks, in 500-line files named chk.aa, chk.ab, and so on, then process each chunk:
$ mkdir chunks $ someprog | split -l 500 chunks/chk. $ cd chunks $ for file in chk.* > do > ... > done $ rm chk.*
Now that we’ve got the basics down, you’re ready to slice and dice text. We’ll do a lot of that in the next segment of this series.
Jerry Peek is a freelance writer and instructor who has used Unix and Linux for more than 25 years. He's happy to hear from readers; see http://www.jpeek.com/contact.html.
[Read previous article]
[Read next article]
[Read Jerry’s other Linux Magazine articles]