Infinitesque

Subjects:

Safe and powerful use of xargs with bash and find

Principal author:
John L. Clark

Abstract

This article provides a detailed examination of one use-case that combines xargs with bash and find, taking care to correctly handle strings that are passed between these utilities.

Say you have a directory of files, and you want to filter each of these files through some transformation and save the result to a new filename that is a function of the original filename. How do you do it? I found myself in just such a situation at work the other day, and I combined one of the unsung Unix power tools, xargs, which shines for this kind of problem, with two other command-line power tools: bash and find. The xargs command can take any number of input strings and distribute those strings as arguments to run another command one or more times; bash and find will hopefully be more familiar to you.

Note

This article leans heavily on the GNU versions of these utilities; many of the special features that I discuss are not available in non-GNU versions. Clearly, everyone should just install GNU software.

My problem was that I had a directory of input XML files with a certain naming convention, and I wanted to apply an XSLT transformation to each of them to convert them to RDF files, placing them in a second directory with a slightly different naming convention (.rdf instead of .xml). Here is the command line I (eventually) concocted:

find xml-dir -mindepth 1 -printf '%f\0' | \
  xargs -0 -n 1 bash -c \
  '4xslt -o "rdf-dir/${0%.xml}.rdf" "xml-dir/$0" transformation.xslt'

Let's walk through this step-by-step. We start with a list of all the files in the XML directory and pipe them to xargs, which will construct one command line per input string because of the -n 1 option. The command we are running is bash, because we want to use the filename-manipulation functionality within bash (part of what bash offers with parameter expansion) after the xargs input line is expanded. To do this, we use the -c option to provide bash with a command string that it will execute after xargs constructs the full command line. The xargs command will append the single, next input string to each invocation of bash, which is then assigned to the 0 special parameter and which we then expand within double-quoted strings to associate them with exactly one parameter to 4xslt. (The 4xslt command is an XSLT processor that uses the 4Suite engine.) It may help to see each command that xargs generates; you can instruct xargs to do this with its -t option.

I find xargs to be useful for solving two different classes of problems. In the first, I have a set of input strings, and I want to run a separate instance of a command for each of those input strings. This is the class of problem illustrated above. The other way that xargs is often used is to take all of the input strings and add them all as additional arguments to a single command. Typically, when running a command, you write out the arguments to that command literally in the command line. Often, however, the arguments come from another source, such as from the output of another command or from a file. Without a -n option, xargs will run the target command with arguments taken from standard input (or from a file, with the -a option).

Some people argue that you can use shell command substitution to solve the same problem. Here's a way to run grep over a pre-specified list of files using command substitution:

grep 'some string' `cat files_to_search`
And here are two approaches to solving the same problem using xargs:
cat files_to_search | xargs grep 'some string' # or
xargs -a files_to_search grep 'some string'
The command substitution approach has the nice feature that it seems more natural; you are placing the command that will produce the arguments in the same place that the arguments themselves would go. On the other hand, xargs gives you the ability to deal with whitespace in arguments, and will split the target command if the length of the arguments list is too long.

If you want a demonstration, try this in a throwaway directory:

for ((i=0; $i<100000; i=$i+1)); do touch "$i"; done
cat `ls` # alternatively, "cat *" is nearly equivalent
ls | xargs cat
The directory will now have one hundred thousand empty files in it, so you'll probably want to delete it when you've tried the above experiment.

There is one more feature of xargs that I'd like to highlight, and that is the ability to run multiple commands in parallel using the -P option. If you are using xargs to run the same command template many times using different arguments, and if you have multiple processors on your machine, then you can spread the command executions across N of those processors by adding -P N to the xargs command. This, for instance, would make a lot of sense with the first example in this article, and this is, in fact, what I did.

The xargs command provides you with a lot of power for controlling command line execution with dynamically provided command line arguments. It's great for constructing one big command with all the arguments, or a set of individual commands each operating on one of the arguments. It helps you avoid some subtle errors, and it can easily provide significant speedup to large batch jobs by executing them in parallel. It fits in very well with a number of other commands in the GNU toolchain, such as find and bash. Keep it in mind when you're slinging arguments around!

This page was last modified on 2008-08-29 00:00:00Z.

This page was first published on .

Creative Commons LicenseThis work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.

See the version of this page with comments enabled to read or add comments.