Using text and utilities to organize and access files.
Linux runs on text. Configuration files are often human-readable text. Many other files contain text, too, and text often flows through Standard I/O connections. Linux has powerful utilities to handle text; you can also use a scripting language.
The names of files and their locations (pathnames) are also usually text. So, the techniques you use to process text can also be used to process files.
(Of course, if looking through file listings and clicking on some of them is the best way to find what you want, Linux has GUI browsers like Nautilus and Konqueror.)
This article covers ways to make lists of files — on-the-fly or in another file — then narrow the list to just what you’re looking for. We’ll use lots of shell loops with redirected I/O; if you need an introduction, see the sections “Let a Loop Do The Work” in Great Command-line Combinations.
The third article in the Filenames by Design series shows ways to find files by name when those files are part of a thoughtfully-designed system. If you’re like me, though, you can only wish that all of your files were in a system that makes everything easy to find. (Some projects are carefully planned. Others are 3 a.m. hacks that you can’t finish neatly before the next crisis hits.)
Attributes like the last-modification timestamp or the size can help you find a file that’s hidden like a needle in a haystack. See the sidebar Some file attributes for suggestions. Of course, attributes aren’t always enough.
One of my favorite quick ways to save files from a project is to make a tar(1) archive in gzip(1) format with a name like project-name_1996-02-15.tar.gz and transfer it into a directory named tarballs on my main system. That’s great if I remember the name of the project or when I worked on it. More likely, though, I’ve forgotten what year it was or what conference I was about to attend when I wrote that file with the example I’m looking for. It’s time for power tools.
(By the way, this is a specific example of a general technique. These ideas also work for single files that aren’t in an archive.)
Start by thinking where the data might be — and, once you find some
likely spots, what tools could extract it. Here we’re looking for gzipped files. Uncompressing each file onto the disk and searching through it can take a lot of disk space. But the GNU zcat(1) utility (also known as gunzip -c
)
reads a compressed file in various formats, uncompresses its contents
on-the-fly, and writes them to standard output. That lets you avoid
temporary files by writing data into a pipe.
We’ll be searching tar archives. What’s in a tarball? It’s a series of sets of metadata for a file followed by the file’s content. We want to find string(s) somewhere in the content of one of those files. A quick-and-dirty technique is to search the entire tarball for the string you’re looking for, filtering the search results to keep non-text characters from messing up the screen. (You may not need tar unless you’re extracting a file from the archive.) Let’s start with that:
$ cd tarballs $ for file in *1996* *usenix* > do > zcat "$file" | > grep -i -H --label="$file" 'pattern' > done | cat -v Binary file ora_1996-04-15.tar.gz matches Binary file usenix_1999.tar.gz matches
*1996* *usenix*
match all filenames in the directory that include 1996 or usenix.
If that list might contain duplicates, you could either use a more specific wildcard pattern or start the loop this way:
for file in $(/bin/ls -d1 *1996* *usenix*) | uniq
/bin/ls -d1
(that’s a digit 1) lists the matching filenames, one per line, in sorted order. Using /bin/ls
bypasses any alias you might have for ls. The -d
option tells ls to list directory names instead of their contents.)
-i
) search for pattern.-label="$file"
makes grep output the filename, expanded by the shell from $file
. (The --label
option seems to also require -H
. on grep version 2.5.1, at least.)cat -v
.
This makes sure that your screen won’t turn into mush if the search
matches a line containing non-textual data — such as a filename,
surrounded by control characters, embedded in a file’s metadata.
The cat -v
trick is a good one. It actually wasn’t needed here, though, because grep decided that the tarballs were “binary” files — that is, the first few bytes were non-textual — so it output “Binary file file matches”. Adding the option --binary-files=text
tells grep to show the matching lines anyway. We’ll try that next.
What if the archives you’re searching are spread around the filesystem instead of in a single directory? You could feed the loop with a recursive file search:
$ find . -type f -name '*.tar.gz' -print | > while read -r file > do zcat "$file" | > grep -i -H --label="$file" --binary-files=text 'pattern' > done | cat -v ./tarballs/ora_1996-04-15.tar.gz:<H1>Patterning Yourself ./tarballs/usenix_1999.tar.gz:and pattern too
There, find outputs file pathnames, one per line. Those pathnames are piped to a while loop where read -r
makes each pathname available in $file
. The loop iterates until read runs out of pathnames.
That search might be overly broad, leaving you looking through lots of results. You could try redirecting the loop output to a file, then paging through the file with less(1):
... > done | cat -v > /tmp/filesearch-output $ less /tmp/filesearch-output ...
Or simply pipe the output of the loop to less:
... > done | cat -v | less
Within less, you can type /pattern
to search for the pattern that grep has found. Each occurrence of the pattern will be highlighted so it’s easier to spot.
The previous example used programs to narrow the search. There are times when it’s better to narrow the list of files by hand, using a tool like a text editor to choose the filenames.
(As mentioned earlier, this technique isn’t just useful for tarballs. For example, if you’re searching for images, a tool like ImageMagick identify -verbose
could do the trick. It sends image metadata and comments down a pipe or
into a file where you can search for the image you want.)
The file filesearch-output in the previous section contains a colon-delimited list of filenames and search results. So:
./tarballs/ora_1996-04-15.tar.gz:<H1>Patterning Yourself
./tarballs/usenix_1999.tar.gz:and pattern too
./
, too. It’s redundant.)
tarballs/ora_1996-04-15.tar.gz
tarballs/usenix_1999.tar.gz
Now you have a file containing just pathnames. Use it to “drill down” to the information you want. For instance, get a list of the contents of each tarball and page through the listings with less:
$ while read -r file > do > echo "====== $file" > tar -tzvf "$file" > done < /tmp/filesearch-output | less
That loop reads filenames one-by-one from the file /tmp/filesearch-output (which you edited to contain just the likely filenames).
It outputs a title line that starts with five equal signs, followed by tar’s verbose listing of the tarball contents. You’ll see something like this:
===== tarballs/ora_1996-04-15.tar.gz: -rw-r--r-- jpeek/users 5250 2003-06-05 10:28 PID_list -rw------- jpeek/users 4106 2004-09-29 11:16 UPT_Russian.jpg ... ===== tarballs/usenix_1999.tar.gz: drwx------ jp123/staff 0 2007-02-17 13:01 difftest/ drwx------ jp123/staff 0 2007-02-17 12:48 difftest/2/ -rw------- jp123/staff 6 2007-02-17 12:41 difftest/2/file1 ...
If there’s still too much text, consider how you can filter the tar output. For instance, to skip the directories in each tarball listing, grep -v
can omit lines that start with d
:
$ while read -r file > do > echo "====== $file" > tar -tzvf "$file" | grep -v "^d" > done < /tmp/filesearch-output | less
(The loop runs two commands for each value of $file
. First, echo writes a title line to stdout. Second, tar lists the archive contents; grep removes the lines for directories and writes the other lines to stdout. After done
, a pipe collects the stdout from both command lines within the loop and less pages through it.)
What we’ve seen here is a general technique: using whatever tools, automated or manual, that will drill down quickly to the results you want. Shell wildcards, and utilities like find, will help you get lists of filenames. Other utilities look inside these files to extract the data and test it.
(By the way, if you aren’t familiar with the shell’s command-line editing, it’s worth learning. It will save you a lot of retyping.)
Jerry Peek is a freelance writer and instructor who has used Unix and Linux for more than 25 years. He's happy to hear from readers; see https://www.jpeek.com/contact.html.
[Read previous article]
[Read next article]
[Read Jerry’s other Linux Magazine articles]