Filenames by Design, Part Three

This column is the third in a series about designing trees of directories and files that help you find data. Because Linux filesystem entries can have almost any character in their names (you can’t use slash or NUL), you can create systems of names that include metadata about the file contents. That makes it easier to find out what’s in a file without needing to read a separate database about the files — or the file itself.

Many of the techniques work on any kind of filesystem tree — not only filesystems with a particular organization. Although we’ll see techniques using shells and utilities, you can also open the files from, say, the menu of a graphical application. Planning ahead at the time you organize your files can make them easier to find and use.

find, your friend

Studying and experimenting with the extremely useful find(1) utility will pay you back many times. (It’s also good to know about the many GNU updates to find.) It’s handy from the command line when you’re trying to locate a particular file. But it’s also great for passing a series of pathnames to utilities, to shell loops, and to scripts in other languages.

Here’s an example: using lpr to print all files with names ending in .txt in each of the subdirectories (or sub-sub-directories.) whose name starts with Denver_07 or Denver_08. (You can enter loops directly at a shell prompt, as we do here. In bash, the secondary prompt > means that the shell is waiting for you to complete a statement.)

$ for dir in $(find . -type d -name 'Denver_0[78]*' -print) > do > cd "$dir" || break > lpr *.txt > cd - > done

The || break ends the loop if any cd "$dir" command fails. Many shells understand cd - as “go to the previous directory”. That’s needed here to return to the starting directory because find is outputting relative pathnames (like ./subdir/Denver_08_2006) that start at a certain directory.

If each filename includes metadata about its file, find can use that filename to choose particular files. For instance, the photo filenames at the end of the first article in this series included the dimensions in pixels of the photo. (The file 0012345_01_5248x4100.tif holds a 5248x4100-pixel photo.) To list all photos at least 4000x4000 pixels in size, you could type:

  find . -name '*_[4-9][0-9][0-9][0-9]x[4-9][0-9][0-9][0-9]*' -ls

Tip: Copy the brace pattern [0-9] with your mouse or your editor, then paste it as many times as needed.

Having a well-thought-out syntax for each filename helps you find them reliably. Luckily, it can be easy to rename files within an organized system like this. For instance, see the section “Renaming existing files” in the previous article in this series.

If you need more “finding” power, try the GNU find option -regex. It lets you use regular expressions instead of the simpler shell wildcard patterns shown above.

Recursive Wildcards: zsh

The amazing Z Shell has recursive wildcard operators ** and *** that do a lot of what find does. And the zsh glob qualifiers restrict how wildcards match. Here are three examples.

The for loop above could be rewritten as follows. The wildcard pattern **/Denver_0[78]* matches all pathnames in the current directory and below, that end with a file or directory
whose name starts with Denver_07 or Denver_08:

zsh% for dir in **/Denver_0[78]* for> do for> cd "$dir" || break .

(Z shell secondary prompts name the incomplete command(s) they’re waiting for — in this case, the for loop.)

If any non-directories in the tree might have a name like **/Denver_0[78]*, you could add the glob qualifier (/) to match only directories:

zsh% for dir in **/Denver_0[78]*(/) .

These recursive wildcards are handy when you know the exact name of a file but you don’t know what directory it’s in. You can even use them as the destination argument to a command. Let’s say you have a file named report123.doc in some directory. You’d like to overwrite it with a copy of the file report123_new.doc from the current directory, while keeping its current name report123.doc. Here’s how — using the cp option -v to show the source and destination pathnames:

zsh% cp -v report123_new.doc **/report123.doc `report123_new.doc' -> `reports/a/1/report123.doc'

Searching by parsing

When find and shell wildcards aren’t enough, try splitting a filename into its parts. For example, you want to find all horizontal photos in the current directory. The directory has mixed contents, but all photos are in filenames ending with .jpg or .tif. Use ls to get a list of filenames, sed to parse the width and height from each name, and the shell’s built-in arithmetic comparison to find the files with a larger horizontal dimension than vertical. (All filenames have a non-numeric character after the vertical dimension.)

ls *jpg *tif |
sed 's/\(.*_\)\([1-9][0-9]*\)x\([1-9][0-9]*\)\(.*\)/\2 \3 \1\2x\3\4/' |
while read -r width height filename
do
  if [[ width -gt height ]]
  then echo "$filename"
  fi
done

The sed s command reads each filename, then writes the width, height, and the filename on its standard output. The shell’s read command reads the first word (up to the first space) into the shell variable $width, the second word into $height, and the rest of the line into $filename. A sample line of sed output might be:

5248 4100 0012345_01_5248x4100.tif

Once you’re familiar with these sed s/// commands, they’re actually quick and easy to type. (Your shell’s command-line editing can help.)

The sed operator pair $ and $ let you “remember” parts of the input text between the first two shashes in s/// and “replay” those parts on the output (between the last two slashes). The first part becomes available from the special escape \1, the second part from \2, and so on.
So the first part of the filename, before the dimensions, can be replayed from \1, the width (the numbers before the x is in \2, the height into \3, and the rest of the filename into \4.
The replacement side of the substitute command outputs the width, a space, the height, a space, and the entire filename — reconstructed from \1\2x\3\4. (In this case, the last pair of escaped parentheses and the \4 actually aren’t needed because, without them, sed would output that text from the end of the line unchanged.)

Of course, you could do something other than echoing the matching filenames. And there are other ways to parse filenames — including using other scripting languages.

From tree to tree

Parallel directory trees with the same structure can be useful. For instance, in the first article of this series, the structure in Figure One has parallel trees rooted at the directories archive, browsing and current.

If you need a duplicate tree, you can create the tree structure by copying the directories only. Here’s one way, using find to find the directories and write their relative pathnames to xargs, which runs mkdir as many times as needed. The old Unix trick of piping into a subshell (the parenthesis operators) means that, while find is outputting pathnames from underneath olddir, the xargs and mkdir programs are running in the directory ../newdir, getting pathnames through the pipe:

mkdir newdir
cd olddir
find * -type d -print | (cd ../newdir && xargs mkdir -pv)

We’re using * with find (instead of the more usual . — which is the current directory) to skip any subdirectories of olddir whose name starts with a dot. (By default, wildcards don’t match those “hidden” directory entries.) The * also gives find “clean” directory names that don’t start with ./. (There’s nothing wrong with a command like mkdir -p ./a/b, but mkdir -p a/b is just “neater”.) By the way, && runs xargs only if the cd succeeded. which prevents copying the directory tree on top of itself if the destination directory ../newdir doesn’t exist.

Filling parallel trees with related files is also easy to do. For instance, to read a list of files in a subdirectory of the current tree, then do an operation on the identical filenames in the browsing tree, a loop like this can do the job:

for f in `ls current/01/200` do something browsing/01/200/$f done

The ls command outputs a list of filenames: 01200_03 01201_01 and so on. Then the something command receives pathnames one by one, like browsing/01/200/01200_03 and browsing/01/200/01201_01. A different loop structure could do something else.

Joining forces with a database

Depending on how much metadata you have about a file, and how long you want the filename to get, you may not want to keep all metadata in the filename. That’s when a database — for instance, a flat file or a relational database — can make sense.

If the data you need to access quickly is stored within the files themselves — for example, the EXIF and IPTC data that’s kept in many digital photo files — consider building a quick-access index file periodically. You could run a cron job late at night, when the system isn’t busy, to read the photo files and write the data you’ll need into index file(s). Linux data files commonly use TAB-separated fields and newline-delimited records; also, utilities for sorting and parsing data files often default to those separator characters. (It’s easy to choose different characters, though.) For example, the first field in an index file might be the file’s directory pathname, the second the filename, the third could contain some sorting token such as the date (from EXIF data) that the photo was created.

If you’re indexing a huge collection of files, building the index can take hours. Consider using multiple index files and updating only the files that need changing. Using find with tests like -mtime, -ctime, -newer, and others, can help you find recently-changed files that need indexing. If you’re sorting data, be sure that your system’s temporary file directory (/tmp, or the directory named in the environment variable TMPDIR) has enough room. If it might not, here are two ways to set another directory while sort runs:

TMPDIR=/some/directory sort .
sort -temporary-directory=/some/directory .

Handy utilities for accessing your index files include:

locate(1) and look(1)
grep(1) using regular expressions that match a particular field(s), with matching lines possibly piped to a field-handling utilities like cut(1) and paste(1)
join(1) is designed to do relational database-type joins. You can use it to merge two or more data sources — for instance, to combine a list of filenames with another file which has the same filenames.
Of course, other scripting languages also have powerful features for handling data.

Finally, you can export and import data from spreadsheets (like OpenOffice.org Calc) in formats that are easy to use with files and utilties. Look for a format that uses TAB-separated fields; it’s an easy choice as long as none of your data includes TAB characters.

Jerry Peek is a freelance writer and instructor who has used Unix and Linux for more than 25 years. He's happy to hear from readers; see https://www.jpeek.com/contact.html.