Filename Trouble

Over the years, Linux and Unix cognoscenti haven’t used spaces in file and directory names. Instead of naming a directory My Pictures, you’d name it My_Pictures, my.pictures, or mypix. Why? Shells break a command line at spaces, and most shells also word-split the results of variable and command substitution. So a filename containing spaces can be split into pieces.

Shells also assign a meaning to most non-alphabetic characters. A file named odds&ends or cool!! can cause trouble if you aren’t paying attention.

Applications that don’t get filenames from a shell don’t have these restrictions. So, Windows and Mac OS users — and users of graphical applications on Linux, too — can cause filesystem havoc for shell users.

What’s a guru to do? This month, let’s see what’s behind the problems and find some workarounds.

Parsing Problems

Each shell command-line — read from a shell prompt or a shell script — is parsed into words and other tokens, such as redirection symbols. By default, words are separated by whitespace: space, tab, and newline characters.

But whitespace characters are legal in filenames. If you’re trying to remove a file named old report.txt this way, watch out:

$ rm old report.txt
rm: cannot remove `old'.
rm: cannot remove `report.txt'.

The shell broke the command-line into three words and passed two arguments to the rm utility: old and report.txt.

Other non-alphabetic characters have their own special meanings to the shell and can cause even more trouble if you don’t watch out:

zsh% ls -s odds&ends
[1] 3736
zsh: command not found: ends
zsh% ls: odds: No such file or directory
[1]  + exit 1     ls -s odds

The &(“ ampersand”) operator runs the words to its left as a background command. So, instead of listing the file odds&ends, the shell ran ls -s odds in the background and tried to run a command named ends in the foreground.

Control Parsing with Quotes

Surrounding an argument with a pair of quoting characters ('' or “”) tells the shell not to interpret some or all non-alphabetic characters in that argument. Single quotes stop all or almost all interpretation; double quotes allow some interpretation. (This varies a little from shell to shell; see your shell’s man page.) You can also type the escape character backslash (\) before special characters that you want the shell to read literally. The shell would have done what we meant if we’d typed:

rm "old report.txt"
ls -s odds\&ends

Let the Shell Do It

If you don’t type a special character on the command-line, the shell generally won’t interpret it. (The next sections list some exceptions.) So, if you enter a filename by using wildcards or filename completion, any special characters in the name won’t be a problem.

When you use filename completion, most shells automatically escape special characters in the completed name. For instance, if you type ls -s od and hit TAB to complete the name, the command-line “automagically” becomes:

zsh% ls -s od[TAB]ds\&ends

Shells parse command lines at spaces before they expand wildcards. So special characters in expanded filenames are passed to a program:

$ rm -i ol*
rm: remove file 'old report.txt'?

Variable Substitution: Ouch!

After most shells expand a shell variable into its value, they parse the value into separate arguments at whitespace. (Bourne-type shells actually use the characters in the IFS variable.) So you’re asking for trouble with code like this, which stores a series of pathnames from find into the shell variable dirname:

find /home -type d -print |
while read dirname
do
   cd $dirname
   .

In all shells except zsh, you should always use double quotes when a variable value might contain whitespace:

cd "$dirname"

Because double-quoting doesn’t preserve enclosed $ characters, variable substitution still happens. When double-quoted, though, expanded values aren’t broken into words. (By default, the Z shell doesn’t break expanded variables into words. Still, using double quotes in zsh isn’t a bad idea.)

Command Substitution: More Problems

Command substitution — the operators ` command ` and $(command) — has the same kind of problem: the expanded command lines are parsed into words at spaces. So a command like the one below can fail when filenames contain spaces:

$ ls
Monday log    Sunday log
Friday log    Thursday log
$ lpr $(grep -l ERROR *)
lpr: Friday: No such file or directory
lpr: log: No such file or directory
lpr: Sunday: No such file or directory
lpr: log: No such file or directory

We’re trying to print all files that contain the word ERROR. The * wildcard expands into a list of filenames for grep to search. That’s no problem, because, as we’ve seen, wildcard expansion handles spaces properly. But when we give lpr the output of grep -l (which is the names of files containing the string ERROR), if grep -l outputs any filenames containing whitespace, the shell breaks those into words and passes lpr some broken arguments.

Putting double quotes around the command substitution characters doesn’t help. This tells the shell not to break the output at whitespace, so multiple filenames will be packed into one argument containing spaces and a newline, just as grep output them:

$ lpr "$(grep -l ERROR *)"
lpr: Friday log
Sunday log: No such file or directory

The workarounds for this general problem are tricky or ugly. What’s needed is a way to split a list of filenames into words accurately, even if the names contain whitespace. You could fiddle with IFS, the shell variable (in Bourne-like shells only) that lists characters used for word splitting, but that list may need to be set case-by-case. And the bottom-line problem is that all of the shell’s word-splitting characters are legal in a filename.

Splitting at NUL characters

Two ASCII characters can’t be used in a filename: slash (/) and NUL (octal zero). Pathnames contain slashes, so you shouldn’t split there. The answer is to use NUL.

The use of NUL characters (all-zero bytes) as separators to avoid whitespace problems started years ago. One of the first places was the find action -print0, which prints pathnames separated by NULs instead of the newlines you get with -print). This is used with the xargs option -0, which expects to read NUL-separated pathnames. Now several GNU utilities emit or read NUL-separated lists of arguments.

For instance, the grep option -Z or --null outputs NUL instead of newline after each filename. So you can do:

grep -lZ ERROR * | xargs -0 lpr

You can also use shell wildcards and the printf utility to output a NUL-delimited string. The command-line printf, as well as the version built into shells, re-uses a format string once for each argument that doesn’t have its own specific format string. (Note that some versions of printf may not work this way.) This can be handy in cases where you’re using a pipeline that feeds NUL-separated pathnames into a command like sort -z.

For example, let’s pass both a list of pathnames from find and some other pathnames from a bash array to sort. Use subshell operators to collect the standard outputs of find, followed by the printf with all of the members of $array. Sort them together, use pr to paginate the files, and lpr to print them all:

$ (find ..... -print0;
> printf "%s\0" ${array[@]} ) |
> sort -z | xargs -0 pr | lpr

zsh Variable Expansion Flags

The Z shell can parse strings into arguments at any arbitrary character, including NUL, by using its variable expansion flags. These can be combined with command substitution to do this useful job directly — without the extra pipe and xargs process that was used above. Let’s have a look.

In all shells, variable expansion like $var can also be written with curly braces, like ${var}. Just after an opening curly brace, zsh allows one or more variable expansion flags in parentheses. The zshexpn manpage lists the flags.

For example, if you want to convert a directory with uppercase and mixed-case filenames so that all names are lowercase, you could use the L or “lower case” expansion flag as follows:

for file in *
do
   mv -i ${file} ${(L)file}
done

But here you want to do something else: split a string at NULs. The variable expansion flag s lets you specify which character should be used to split a string into arguments, and the p flag lets you use an escape sequence instead of a literal character. The flags to use are (ps:\0:). (\0 is the escape sequence for a NUL character.)

You could type:

zsh% grepout="$(grep -lZ ERROR *)"
zsh% lpr ${(ps:\0:)grepout}

But, in place of the variable name, you can use command substitution. Remember that the result of command substitution is word-split by default. You don’t want the result to be split at whitespace, so you’ll need double quotes to keep the grep output in a single string before splitting it at NULs. The result will be a string of filenames separated by NULs; any non-NUL characters will be preserved as part of their filenames.

Here goes:

zsh% lpr ${(ps:\0:)"$(grep -lZ ERROR *)"}

If you’re new to zsh, that may look scary, but it’s actually just a more-complex example of expansion flags. The Z shell’s expressive syntax lets you do a lot with a little typing.

A Workaround with Symbolic Links

If you’re stuck with difficult-to-use filenames, you could make a directory full of sensibly-named symbolic links that point to the difficult filenames in some other directory. The shell script safedir does this.

At more than 100 lines — mostly comments and error-checking — the script is too long to print in this short column. (You might want to download it and follow along as you read this.) Let’s take a look at the techniques that safedir uses.

Listing One: The character translation list

xlate='=\!#$%^&*()|\\'“'”';/<>?~`[]{}
'
fix=_

Listing One shows the definition of the shell variable xlate. It holds a string of characters that you don’t want to have in filenames. The tr utility will replace each of those characters with an underscore (_), taken from the fix variable. The string in xlate contains space, tab, and newline characters. (The newline is embedded in the string because the closing quote character is on the next line.)

The backslash (\) is repeated because tr uses the escape sequence \\ to represent a literal backslash.

Notice the quoting. Because a string surrounded by single quotes can’t itself contain a single quote, the code switches quote characters mid-string. The first characters in the string are surrounded by single quotes. Next comes a single quote surrounded by double quotes. The last characters in the string again have single quotes around them. Because there’s no space outside quote marks, all of this is stored into the xlate shell variable — without the three pairs of quotes surrounding each section of the string, of course.

Listing Two shows the “guts” of the script: a for loop that steps through all files in the source directory, making a symlink from the destination directory with a “safe” name.

This script is running with bash (it starts with #!/bin/bash to require that shell explicitly), so you can use the bash option -E to be sure that echo doesn’t try to interpret part of the original filename from $f as an escape sequence.

In most versions of tr — including the GNU version that you should have on your Linux system — the replacement character string can contain a character followed by an asterisk (*), surrounded by square brackets. Here that character is an underscore from the shell variable $fix. The asterisk on replacement pattern [_*] tells tr to replace any character from the first string (listed in $xlate) with an underscore.

One thing that surprises even some experienced Bourne shell programmers is the line if ln -s.. This runs the ln -s command and tests its exit status. If the exit status is zero (“success”), the then block of the if statement is executed; otherwise the else block. (Many people believe that if must always be used with the test or [ command, but that’s not true.)

The standard output of the for loop is redirected to a file named .safedir_translation_table (from the variable $ttable). This file is a tab-separated list of each newly-generated “safe” name followed by the original filename.

Listing Two: Making a safe filename

# Loop through current directory and make symlinks.
for f in *
do
fnew=$(echo -E -n “$f” | tr “$xlate” “[${fix}*]”)
.
if ln -s “$curdir/$f” “$destdir/$fnew”
then
# Write to stdout, redirected at end of loop to $ttable file:
# This string has an embedded TAB (we can’t use \t because
# “echo -e” could mistakenly interpret an escape sequence in $f):
echo -E “$fnew $f”
else
echo “$myname: ABORTING: can’t create symlink $destdir/$fnew?”
1>&2
giveup
fi
fi
done > “$destdir/$ttable”

The script makes the symbolic links in a subdirectory of /tmp. The directory’s name ends with your username and the shell’s process ID number, which makes the name likely to be unique. The script accepts one command-line argument: the pathname of the directory with the “unsafe” names. After making the symlinks, the script writes the temporary directory name to standard output. So you can run the script with command substitution — to create the temporary directory and cd (or pushd) to it. Here we’re making a copy of the current directory (whose pathname is always available as .):

tcsh> pushd `safedir .`

The script is new and may be changed somewhat. If you’re reading this article in the future, the file you find online may be a bit different than what’s describe here.

Jerry Peek is a freelance writer and instructor who has used Unix and Linux for over 20 years. He’s happy to hear from readers; see https://www.jpeek.com/contact.html.