Safe and powerful use of xargs with bash and find

Say you have a directory of files, and you want to filter each of these files through some transformation and save the result to a new filename that is a function of the original filename. How do you do it? I found myself in just such a situation at work the other day, and I combined one of the unsung Unix power tools, xargs, which shines for this kind of problem, with two other command-line power tools: bash and find. The xargs command can take any number of input strings and distribute those strings as arguments to run another command one or more times; bash and find will hopefully be more familiar to you.

Note

This article leans heavily on the GNU versions of these utilities; many of the special features that I discuss are not available in non-GNU versions. Clearly, everyone should just install GNU software.

My problem was that I had a directory of input XML files with a certain naming convention, and I wanted to apply an XSLT transformation to each of them to convert them to RDF files, placing them in a second directory with a slightly different naming convention (.rdf instead of .xml). Here is the command line I (eventually) concocted:

find xml-dir -mindepth 1 -printf '%f\0' | \
  xargs -0 -n 1 bash -c \
  '4xslt -o "rdf-dir/${0%.xml}.rdf" "xml-dir/$0" transformation.xslt'

Let's walk through this step-by-step. We start with a list of all the files in the XML directory and pipe them to xargs, which will construct one command line per input string because of the -n 1 option. The command we are running is bash, because we want to use the filename-manipulation functionality within bash (part of what bash offers with parameter expansion) after the xargs input line is expanded. To do this, we use the -c option to provide bash with a command string that it will execute after xargs constructs the full command line. The xargs command will append the single, next input string to each invocation of bash, which is then assigned to the 0 special parameter and which we then expand within double-quoted strings to associate them with exactly one parameter to 4xslt. (The 4xslt command is an XSLT processor that uses the 4Suite engine.) It may help to see each command that xargs generates; you can instruct xargs to do this with its -t option.

Questions and answers

There are a number of slightly different approaches that I could have taken that might seem shorter or clearer in some way, but which would be wrong. Let's take a look at some of these.

Why not use ls instead of find?

When using ls, the first stage of the pipeline would look like:

ls xml-dir

In fact, that was the first approach that I used, until I realized the problem. If any of the filenames in xml-dir contained spaces or newlines, then xargs would use each of the whitespace-separated components of such names as arguments to the target command, instead of the whole filename, as desired. For that reason, the xargs man page suggests separating argument strings with a null character, and provides the -0 option to process such argument strings. The ls command does not provide any way to separate output entries with a null character, but find does.

Ok, so why not use the -print0 option to find?

The find and xargs commands were clearly designed to work well together, and the (GNU) xargs documentation even mentions that “[t]he GNU find -print0 option produces input suitable for this mode” (that is, when using the -0 option). In many cases, this is exactly what you want. In our case, however, we don't want the full filename, including its directory, but only the last component of the filename. Using -printf '%f\0' gives us just this last component, and still separated by a null character.

So what's up with the -mindepth 1 option?

This is another option that will not be generally relevant, but rather meets this use case. This option allows us to exclude the directory itself as one of the input strings.

I find xargs to be useful for solving two different classes of problems. In the first, I have a set of input strings, and I want to run a separate instance of a command for each of those input strings. This is the class of problem illustrated above. The other way that xargs is often used is to take all of the input strings and add them all as additional arguments to a single command. Typically, when running a command, you write out the arguments to that command literally in the command line. Often, however, the arguments come from another source, such as from the output of another command or from a file. Without a -n option, xargs will run the target command with arguments taken from standard input (or from a file, with the -a option).

Some people argue that you can use shell command substitution to solve the same problem. Here's a way to run grep over a pre-specified list of files using command substitution:

grep 'some string' `cat files_to_search`

And here are two approaches to solving the same problem using xargs:

cat files_to_search | xargs grep 'some string' # or
xargs -a files_to_search grep 'some string'

The command substitution approach has the nice feature that it seems more natural; you are placing the command that will produce the arguments in the same place that the arguments themselves would go. On the other hand, xargs gives you the ability to deal with whitespace in arguments, and will split the target command if the length of the arguments list is too long.

If you want a demonstration, try this in a throwaway directory:

for ((i=0; $i<100000; i=$i+1)); do touch "$i"; done
cat `ls` # alternatively, "cat *" is nearly equivalent
ls | xargs cat

The directory will now have one hundred thousand empty files in it, so you'll probably want to delete it when you've tried the above experiment.

There is one more feature of xargs that I'd like to highlight, and that is the ability to run multiple commands in parallel using the -P option. If you are using xargs to run the same command template many times using different arguments, and if you have multiple processors on your machine, then you can spread the command executions across N of those processors by adding -P N to the xargs command. This, for instance, would make a lot of sense with the first example in this article, and this is, in fact, what I did.

The xargs command provides you with a lot of power for controlling command line execution with dynamically provided command line arguments. It's great for constructing one big command with all the arguments, or a set of individual commands each operating on one of the arguments. It helps you avoid some subtle errors, and it can easily provide significant speedup to large batch jobs by executing them in parallel. It fits in very well with a number of other commands in the GNU toolchain, such as find and bash. Keep it in mind when you're slinging arguments around!

Safe and powerful use of xargs with bash and find

Abstract