Linux Runs on Text: Understanding & Handling Text

This month, as another part of the series about using text on Linux systems, we’ll introduce “plain text” and how you can restructure it. We’ll see how to identify text from different systems (Unix, DOS, Mac) and to convert text between systems. The article ends with some examples, and there’ll be lots more next month.

If you’re used to clicking on files to view and edit them, you’ll probably find some new tools and concepts here. Gurus, please have a look at the main example and be sure it’s familiar.

The fundamental concept is the role of the newline (line feed) character. Reformatting text is basically a matter of juggling newlines. Let’s dig in.

What’s Text?

As I wrote in last month’s column, Linux runs on text. Text comes in a lot of flavors. What we’ll cover this month is plain text: a stream of characters that you can output directly to a terminal window using a utility like cat(1). The text doesn’t have special formatting codes like “start boldface” or “24 pixels high” that are only understood by certain operating systems or applications. Plain text doesn’t require a word processing program like OpenOffice.org Writer to interpret instructions buried before and between the actual text.

Let’s make a test text file. We’ll use it to demonstrate a lot of things about text. Although the example is a bit tedious, the techniques will be useful, later, when you need to know what’s in a text file. We’ll make this text file on a Windows system, and make a similar file later under Linux.

On Microsoft Windows systems, each line of a plain text file ends with a carriage return (CR) character followed by a newline (LF, line feed) character. Let’s use a “Command prompt” window to copy text typed from the keyboard (con:) into a Windows-format file named win.txt. On DOS-type systems, pressing CTRL-Z followed by the ENTER key ends input. The boldfaced text is user input; the rest is system output:

D:\tmp>copy con: win.txt
        line 1
line 2^Z
        1 file(s) copied.

D:\tmp>

Next, let’s read that file from a Linux-type terminal window — for instance, under Cygwin or from a Windows filesystem cross-mounted onto a Linux host. We’ll look at the file with cat, cat -tve, and od -c; and count the number of lines, words, and characters using wc(1). (For an introduction, see the section “od and friends” of Wizard Boot Camp, Part 10.)

/d/tmp$ cat win.txt
        line 1
line 2/d/tmp$
/d/tmp$ cat -tve win.txt
^Iline 1 ^M$
line 2/d/tmp$
/d/tmp$ od -c -w6 win.txt
0000000  \t   l   i   n   e
0000006   1      \r  \n   l   i
0000014   n   e       2
0000020
/d/tmp$ wc win.txt
 1  4 16 win.txt
/d/tmp$

Here are details. (You might open another copy of this page in a separate window so you can see the example while reading the details.)

The first command, cat win.txt, shows a file that looks like the text we entered in the DOS window. However, the bash shell prompt, /d/tmp$, comes just after the text line 2 from the file — instead of on a new line by itself.

Why? It’s because (as we’ll see below) the contents of win.txt don’t end with a newline character. The shell always prints a prompt immediately after the output of a command (in this case, the cat utility) finishes. There’s no newline at the end of the file, so the shell prompt appears on the same line.
The second command shows a lot more:
1. The option -t tells cat to show TAB characters as ^I, so you can see that the indentation before line 1 is caused by a TAB.
2. The option -v tells cat to show “nonprinting” characters visibly, which lets us see that there’s a carriage return character, shown as ^M, after a space character, following the text line 1.
  
  Each line of a DOS text file ends with two characters: a carriage return and a newline (line feed). After showing the carriage return visibly, cat output the newline preceded by a $ character:
3. The option -e tells cat to mark the end of a line with $. This lets you see just where a newline falls.
The third command, od -c, shows the character representation of bytes one-by-one. The -w6 option lists six bytes per line. Each line starts with the octal offset from the start of the file. You can see:
1. The first six bytes (at offsets 0000000 through 0000005) are a TAB character (which od shows as \t), the word line and a space character.
2. The second six bytes (from offset 0000006) are the digit 1, a space character, a carriage return character (which od shows as \r), a newline character (which od shows as \n), and the first two characters of the next line of the file, li.
  
  od shows the structure of a text file. The newline character — the end of the first “line” in the file — is just a character. The bytes of the next “line” start immediately after the newline. (As we’ll see later, you can insert newline characters anywhere you want to start new lines.)
3. The last four bytes (octal offsets 0000014 through 0000017) are the letters n, e, a space, and the digit 2. There’s no carriage return, no newline. That’s because, while making the file, we typed the DOS end-of-input character CTRL-Z before pressing RETURN (ENTER) to end the line.
The wc utility reports 1 line, 4 words, and 16 characters.
1. Because there’s only one newline character, there’s only one “line”. (The second line isn’t complete.)
2. There are four words: line, 1, line, and 2.
3. The 16 characters include the carriage return and the newline. (You can see them in the od output, and see the the final offset — 0000020 octal, which shows the number of bytes read — is 16 decimal.) Although the TAB makes a lot of whitespace (it moves the cursor to the next “tab stop” on the terminal, as we’ll see below), it’s only a single character.

Types of text files

The previous section showed a text file from a Windows system. Each complete line in a Windows file ends with CRLF (carriage return and line feed characters).

On Unix and Linux systems, each line of a text file ends with a newline (LF, line feed) character. To see that, we’ll repeat the previous example on a Linux box. Let’s use the built-in bash utility echo -en this time. (If you aren’t using the bash shell, try typing /bin/echo instead of echo.) The -e option tells echo to convert escape sequences into characters; for instance, when echo reads \t, it echoes a TAB character. The -n option suppresses the final newline character:

/tmp$ echo -en "\tline 1 \nline 2" > lin
/tmp$ cat lin
        line 1
line 2/tmp$ cat -tve lin
^Iline 1 $
line 2/tmp$ od -c -w6 lin
0000000  \t   l   i   n   e
0000006   1      \n   l   i   n
0000014   e       2
0000017
/tmp$ wc lin
 1  4 15 lin
/tmp$

The difference between the previous Windows example and this one is that there’s no CR character here.

What about other OSes? Old Macintosh files use CR as a line terminator. New Macs, with Unix underneath, use the Unix LF terminator. (If you’d like to read more, here’s a Wikipedia article about “Newline”.)

Converting text files

Utilities like dos2unix, unix2dos, fromdos and todos add or remove a CR before each LF, as needed. Here’s another way to remove all CR (octal 15) characters from a file:

/tmp$ tr -d '\015' < dos_file > linux_file

If a line happens to have a carriage return in the middle of it, that tr method will strip those too. (For instance, to boldface a line on typewriter-like printers, make the printer’s carriage move to the start of the line and reprint the line several times, as in line^Mline^Mline.) To remove CR only at the end of a line, try sed; use \r in a regular expression to match CR and $ to match end-of-line:

/d/tmp$ sed 's/\r$//' win.txt | od -c -w6
0000000  \t   l   i   n   e
0000006   1      \n   l   i   n
0000014   e       2
0000017

Tabs

A TAB is a single character. On typewriters (remember them?) and printers with a moving head, a TAB moves the current position to the next tabstop location. Tabstops are typically 8 characters apart. For example, let’s say that the first position on a line is position 0, the second is position 1, and so on. So, if the cursor, printer head, etc., is in positions 0 through 7, a TAB will move to position 8. If the current position is in 8 through 15, a TAB moves to 16. We can see this using echo to output two lines. The X in the second line appears underneath the 8 in the first line, which is the next tabstop:

/tmp$ echo -e "0123456789\n..\tX"
0123456789
..      X

These days, TABs are more useful as field separators. For instance, if you save a spreadsheet to a plain-text file, one row per line, with a TAB between each cell from a row, you can use powerful Linux text utilities to extract or rearrange the data. We’ll see some examples next month.

Doublespacing

Once you know how text files are structured, you can rearrange the newlines. Want to doublespace some text — that is, add a blank line after every line? At the end of each line, add another newline. Let’s do it with sed. First, make a two-line test file:

/tmp$ echo -e "Line A\nLine B" > sed.in
/tmp$ cat sed.in
Line A
Line B
/tmp$ sed 's/$/\
> /' sed.in > sed.out
/tmp$ cat -e sed.out
Line A$
$
Line B$
$

(When you type a multiline command at a shell prompt, bash uses its secondary prompt > until the command is complete.) cat -e shows that sed.out has an empty line after each line.

What argument is sed actually getting? To find out, use echo -En to write the same argument to od -c. Using -E makes sure echo doesn’t interpret the backslash:

/tmp$ echo -En 's/$/\
> /' | od -c
0000000   s   /   $   /   \  \n   /
0000007

od shows that, after the backslash at the end of the first line (which is required so sed knows that its s command hasn’t finished), sed gets a newline (LF) character on the replacement side of the s command. So, at each existing newline, sed is adding another newline.

“Chunking” lines

How can you break an input with many lines into “chunks” of, say, 56 lines each? The pr(1) utility does this; it outputs a file or standard input in 56-line chunks, with a 5-line header before each chunk and a five-line footer after each chunk. (This fits printers that print 66 lines per 11-inch page.) If you have any long set of data, a technique like this can cut it into more-manageable pieces.

A shell loop with redirected input can do the same thing as pr, and the technique is worth knowing when you need chunks of text. Listing One shows an example. The files named on the chunk56 command line — or the standard input, if no files are named — are written to the loop’s standard input. Each call to head -56 reads the next 56 lines of the loop’s standard input, which the shell stores in $chunk. Three echo commands output a 5-line header, the body, and a 5-line footer. Once $chunk is empty, the endless loop (driven by the shell’s colon operator) is broken.

Listing One: chunk56: outputs 66-line pages

#!/bin/bash
page=1
cat "$@" |
while :
do
  chunk=$(head -56)
  if [[ -n $chunk ]]
  then
    echo -e "\n\nPAGE $page\n\n"
    echo -E "$chunk"
    echo -e "\n\n\n\n"
    let page=page+1
  else
    break
  fi
done

Two tools for doing similar “chunking” — but writing the input to individual files instead of to standard output — are split(1) and csplit(1). As a quick example, here’s how you could split the output of someprog into a subdirectory named chunks, in 500-line files named chk.aa, chk.ab, and so on, then process each chunk:

$ mkdir chunks
$ someprog | split -l 500 chunks/chk.
$ cd chunks
$ for file in chk.*
> do
>   ...
> done
$ rm chk.*

(Much) more next month...

Now that we’ve got the basics down, you’re ready to slice and dice text. We’ll do a lot of that in the next segment of this series.

Jerry Peek is a freelance writer and instructor who has used Unix and Linux for more than 25 years. He's happy to hear from readers; see https://www.jpeek.com/contact.html.