
Lets talk about xargs. It's useful, reliable and far too inflexible.
#What is xargs?
For those that're unfamiliar, xargs is a POSIX program that's used to build
command lines. You produce some output (or supply a text file) and each line of that
output is substituted into a command line and then evaluated. For example, if I have
a list of urls in a file foo.txt:
https://foo.com/bar
https://foo.com/baz
https://foo.com/bagand I'd like to download all of them with curl, I can run:
cat foo.txt | xargs curlFor debugging try running cat foo.txt | xargs echo curl, it should output:
curl https://foo.com/bar https://foo.com/baz https://foo.com/bagxargs has taken each line of input appended to the end of our echo command and
then run the command (which is why the command line is outputted without an echo
prefix). Now under the hood xargs has a bunch of different (highly useful) options,
but this should give a high level overview of what it does. Read a file, build a
command-line, run the command. It also automatically starts a new command whenever a
command line gets too long (which would normally cause it to crash).
The main options of xargs that're worth mentioning include:
| Flag | Argument | Description |
|---|---|---|
-r | Don't run any command if the input file supplies no arguments | |
-d | DELIMETER | Treat DELIMITER as the seperator between arguments [1] |
-I | PATTERN | Replace each occurrence of pattern in the command line with an argument [2] |
-a | FILE | Read arguments from FILE instead of standard input |
[1]: by default xargs has some weird issues with quotes in the arguments, that's
why it's recommended to use the \0 delimiter with the input (eg. find -print0)
however with the -d flag set to \n you can get the same behaviour with just
regular newlines.
[2]: In the curl example above it's OK to just append arguments to the end
of the command but if you need more flexibility you can prepend arguments to it
as well (eg. xargs -I {} python3 {} --shared-flag)
-I flag only supports substituting one argument into the command line at a time.#The Need for Parallel
So that's xargs, why do you need parallel?
#Exec vs. System
xargs always uses the exec system-call for running a sub-command. Meaning it can
only run a single process directly. So say you'd like to run multiple programs on the
input:
xargs -r -d '\n' -I{} -a urls.txt \
curl -LO {} ';' echo "Finished downloading" {}Here I've tried to curl a URL and then echo out that it has finished. If you
substituted {} for each line in urls.txt and then tried to run it in your shell
this would probably work... but not with xargs. Instead of running two commands
curl -LO URL and echo "Finished downloading" URL xargs will instead run a
single curl command with the arguments -LO URL ; echo "Finished downloading" URL.
There're ways to get around this. You can spawn a subshell directly:
xargs -r -d '\n' -I{} -a urls.txt \
sh -c "curl -LO {}; echo 'Finished downloading' {}"Now we can run multiple commands; of course there're clear issues with what to do
when our current argument {} needs to be quoted or contains an unescaped shell
expression. Not to mention the headache of building more complex command lines with
multiple levels of quoting.
For example if the program you're reading arguments from contained something your shell can evaluate, it'll evaluate it with your user permissions.
echo -e '`echo foobar; DO-SOMETHING-BAD`' |
xargs -r -d '\n' -I{} \
sh -c 'echo curl {}; echo "Finished downloading" {}'This tries to download the URL foobar but lets the supplying program basically
include any arbitrary expression they want.
parallel on the other hand has a few key differences. Firstly parallel always
spawns a subshell and by default always quotes any arguments. That means you can
include multiple expressions in the final command line and not have to worry about
quoting:
echo -e '`echo foobar; DO-SOMETHING-BAD`' |
parallel curl {} \; echo "Finished downloading" {}I'd argue readability alone justifies using parallel exclusively, however also note
that this doesn't cause you to run the subcommand echo foobar; DO-SOMETHING-BAD; it
passes it properly escaped directly to curl and echo, avoiding a potentially
catastrophic vulnerability in your scripts.
#Field Index Expressions and Formatting
Even better parallel lets you split input arguments into fields like awk. For
example:
echo -e 'foo:bar\nbaz:bag\nbam:boom' |
parallel --colsep ':' echo mv {1} {2}Which outputs this:
mv bam boom
mv baz bag
mv foo barEach input line is split using the colsep argument and can then be accessed using
{N} for the Nth field, beginning from 1. {} and {0} contains the
entire input expression.
parallel also has a few other neat substitution patterns for dealing
with files and URLs specifically:
| Pattern | Description |
|---|---|
{.} | The argument without the file extension |
{/} | The basename of the argument |
{//} | The dirname of the argument |
{/.} | The basename of the argument without the extension |
#Concurrent Execution
It isn't called parallel without a good reason (・◇・). Both xargs and
parallel both support running multiple commands simultaneously. Once a command-line is
built the subcommand is then run and xargs/parallel can move on to building the next
command line until a configurable maximum number of subcommands are running at the same time.
As one command exits the next one can begin until no more need to be created.
By default xargs only runs 1 subcommand and parallel runs as many commands as the
number of CPU cores we have. You can configure the amount you want running
concurrently with the --max-procs flag.
In practice the only real difference between xargs and parallel here (aside from
the respective defaults behaviour) is how they treat program output.
xargs doesn't. Whatever the sub-commands print is printed to the current
stdout, even when multiple subcommands try to do so simultaneously. This means your
unlikely to get useful output when running xargs with multiple processes.
parallel is a little more friendly. The default behaviour is to wait until a
program finishes and then to write its output. This is slow and not good for
long-running commands (for which you'd like to get progress updates).
You can pass the --line-buffer flag to make parallel output each line from any
subcommand as its printed; preventing cluttered output and maximising usability.
The --ungroup flag makes parallel treat subcommand output the same as xargs.
#Remote Execution
Honestly this feature is probably overkill... but of course I have to talk about it.
parallellets you outsource the processing of a subcommand to another host.
Remote execution, right from your tiny little computer. All right, to explain
what this means consider for a second that you have an army of tiny computers (eg.
Raspberry-Pi'). Each computer is its own host with its own operating system and
resources. When dealing with file commands parallel lets you send the file to one of
these hosts, run the subcommand with it, and then copy back the output (stdout and
maybe files as well) to your localhost.
So say you've got 10,000 little folders you'd like to compress each to their own
archive (foo/ -> foo.tar.gz, bar/ -> bar.tar.gz). Doing that on your own
machine might take a while and slow down your resources but with parallel you can
share the computational cost of this processing across a network of computers.
find -mindepth 1 -maxdepth 1 -type d |
parallel --sshlogin hostname --transferfile {} --return {/}.tar --cleanup \
tar -czvf {/}.tar {}Now in practice this is only for really massive computational problems and even then the time loss of copying files to and back from a remote host might outweight whatever benefits you get from it... but damn that's cool!.
#Why Use Parallel?
In the end though, if not for all those above features what actual benefit does
parallel bring. In my opinion... it's that its mostly compatible with xargs. That
means you can parallelize programs you'd normally run sequentially in a way that
works with parallel but doesn't depend on it.
For example I often use a script to find a bunch of repositories on my
system. I specify a list of places I want the script to look at using a REPO_PATH
environment variable then run a find command on all of them asserting some
condition to confirm whether they are a repository or not. So if I had
REPO_PATH=$HOME/foo:$HOME/bar my script will check all the folders inside of
~/foo and then all the folders in ~/bar to see if their repositories.
The clear issue with this is find doesn't have a nice command-line interface for
xargs. You can only add paths to find in at the start of the command (eg. find
PATH1 PATH2 PATH3 -options). So the classic approach here would be:
cat paths.txt |
xargs -r -d '\n' -I{} find {} -optionsSo each path gets its own find process which can result in a lot of find processes
that don't seem necessary (NOTE: -I and --max-args are mutually-exclusive so
you're forced to only ever be able to substitute a single argument). With parallel
however you can substitute as many directories for {} as you desire (using the -N
flag).
cat paths.txt |
parallel -N 100 find {} -optionsThis will spawn only one find command when the number of paths is less than 100
(and fits into the maximum length of a single command line). Compare this to the
xargs equivalent spawning 100 find processes (one for each path).
#Replacing Xargs
Observe that in the previous section there's no need to use exclusively parallel or
xargs. Because the two share mostly a common interface you can swap xargs for
parallel in most situations. I do this by checking at the start of my scripts
whether parallel is available and then use it (otherwise defaulting to xargs).
batch=( )
if command -v parallel >/dev/null 2>&1; then
batch+=( parallel -d '\n' -r -X --line-buffer --quote -N 100 )
else
batch+=( xargs -d '\n' -r )
fiNow whenever I want to run a command like xargs I instead use:
"${batch[@]}" find {} -printThis is a mild performance boost we'll probably never notice but it's nice to know
it's there. Rather than inefficiently spawning find processes left, right, and
centre, we now only spawn as many as we need 😄.
#Conclusion
This has been a brief but informative look at GNU parallel. Let me know if
there're any other interesting applications or features of parallel that you've
found. For those interested in learning more about parallel, checkout the manual.