A brief introduction to shell scripting. Please see /work/projects/BioITeam/common/scrtips for a number of well-written scripts.
What and why is Shell Scripting?
A shell is a program that takes your commands from the keyboard and gives them to the operating system. Most Linux systems utilize Bourne Again SHell (bash), but there are several additional shell programs on a typical Linux system such as ksh, tcsh, and zsh. The simplest way to check which shell your machine has is to type any random letters and hit enter. For example, Lonestar in TACC uses bash.
A shell script is series of commands written in a plain text file. Instead of entering commands one by one, you can store the sequence of commands to text file and tell the shell to execute this text file. When you want to repeatedly execute the series of command lines for multiple datasets, the shell script can automate your task and save lots of time.
Exercise 1 - Hello world
Below is a simple shell script that takes one argument (the text to print after "Hello") and echos it.
- The first line tells the shell which program to use to execute this file (here, the bash program).
- The 2nd line sets the shell variable TEXT to the first command line argument.
- The 3rd line defaults the value of TEXT to the string "Shell World" if no command line argument is provided.
- Remaining lines echo some text, substituting the value passed in on the command line.
Open your favorite text editor, enter these lines, and save as hello.sh (note the file extension for shell scripts is .sh). Then open a Terminal window and change into the directory where the script was saved. For example:
The script can be run, with or without command line arguments, by explicitly invoking bash as follows:
There is a shortcut, though. Since we have the line at the top of this file that names the program that should run it, we should be able to execute the script just by typing in its pathname like this (where ./ means current directory):
But there''s a complication. Welcome to the world of Unix permissions! The script file must be marked executable for this to work. To see what the current permissions are:
This says that anyone can read the file, the owner (you) or anyone in your group can modify it (write permission), but no one can execute it. We use the chmod program to allow anyone to execute the script:
Now hello.sh can be invoked directly:
Note that when we supplied the text "Expert scripter", we put it in quotes, which group the two words into one argument to the script. Without the quotes, the word "Expert" would be seen by the script as argument 1 and "scripter" would be seen as argument 2 (which our script ignores).
BWA alignment script
The first real script you will likely find yourself wanting is one that performs a standard set of alignment tasks such as mapping, bam file creation and statistics reporting. The script we want for the bwa aligner would do the following:
- Aligns a fastq file to a pre-made reference genome
- Extracts alignments from bwa's proprietary binary .sai file to a .sam file
- Converts the .sam file into a .bam file using samtools
- Sort and index the .bam file so that it can be viewed in IGV.
- Count the number of aligned and unaligned reads, and calculate the mapping rate.
Since we want to use this script on different datasets, it should take some arguments on the command line telling it what to work on. Let's have it take the following arguments:
- Name of the input fastq file (or the R1 file if paired).
- A prefix to use when writing output files (e.g. <prefix>.bam).
- Name of an reference genome to use. The script will find the appropriate reference index based on this value.
- A flag indicating whether single end or paired end alignment should be done. 0 = single, 1 = paired.
Here is a completed Example BWA alignment script.You may want to open it in a separate window so you can read along as it is discussed here. It is also available in the course materials as align_bwa.sh.
What could possibly go wrong?
The first thing you will notice about this script is that there is a lot of argument and error checking -- more than the actual "work" code! This is a hallmark of a well-written shell script, especially one that will be run at TACC by many processors at a time.
Let's say you run 20 alignments in parallel at TACC. How do you know if they all completed successfully? If some did not, which ones? And how do you tell what went wrong? Do you really want to poke around in 20 directories/files to figure it out?
The approach this script takes to error checking is that many many things can go wrong. This is from experience: every error check in this script checks for something that has gone wrong for us in the past :)
Shell scripts can define functions, which are a convenient way to avoid repeating the same few lines of code again and again. This philosophy of code writing is called DRY, for Don't Repeat Yourself. Like shell scripts themselves, shell script functions take their arguments on the line that invokes them, and refers to them as $1, $2, etc.
Let's look at the simplest function in the align_bwa.sh script:
This function takes one argument, echoes it along with some boilerplate text ("...exiting") and exits. You might call it like this from a shell script:
Why write a function this simple? Because we're going to build on it, writing more specialized error checking functions, and we want all of them to print out the boilerplate text ("...exiting") before they exit. If we ever want to change this boilerplate text we only have to change it in one function! And, since the text is well-known and not likely to be written by successful programs, we can easily grep for it in our execution log files. For a single file:
Or even better, for any log file in any subdirectory of the current directory:
Here is a more specialized ckFile function that checks for the existence of the file name passed as its first argument:
If the file name passed as the first argument exists, nothing happens when ckFile is called. If the file does not exist, the shell script exits at the line where the ckFile called after printing out a diagnostic message that includes our boilerplate (because this function calls err).
The function can be called with one or two arguments, for example:
Here is another function, ckRes, that checks the result code passed in as its first argument. It uses the text passed as its second argument either to print a diagnostic message (by calling our friend err) or to print a message showing that the task completed, and when:
A time-honored convention is that all programs whether shell scripts, built-in shell commands, user-written scripts or other programs, exit with a return code of 0 if all went well, or with any other integer return code if not. Calling programs can then check the return code to see if something went wrong. In the bash shell, the just-executed program's return code is placed in the special $? variable, which should be checked right away because doing anything else will reset it. So for example, to check whether a call to bwa aln returned 0 (ok, keep going) or not (bad, exit with message):
If the alignment was successful, a message like this will be written to the execution log:
If the program's return code was non-zero, a message like this will be written, and the script will terminate.
And yet one more wrinkle. For a further refinement of file checking, we also check that the file's size is non-0. Why? Because:
- Programs don't always return a non-0 return code. For example, if called with no arguments just to show usage they often return 0 (after all, there was no error, even if nothing was done). Even well-written programs sometimes neglect to return non-0 exit codes in some circumstances.
- Sometimes a program (or the shell) creates an empty output file before doing anything else. So if it doesn't return a non-0 error code, you can think the program ran fine, even if the file exists. This happens often enough to warrant a special check.
Here's a ckFileSz function that accomplishes this goal, building on the ckFile and err functions:
It first calls ckFile to see if the file exists. Only if it does do the further statements get executed. The file size check is performed by piping the result of ls -l <file> to an awk script that just echos the file size part of the line (field 5). This chained command is executed by putting it in back quotes and its result stored in the SZ variable, which is then checked to see if it was the string "0".
Of course a program could still produce a file with a non-0 size then error with a non-0 exit code. How might you address this possibility?
OK, enough of boring (but neccessary!) error checking. Onward to more interesting things!
Another time-honored program-writing convention is to provide your users with information on how to run the program The first thing our script does, after capturing its first 4 command line arguments in variables, is to check whether the last required argument (PAIRED, the 4th argument) is empty, and if so, prints detailed usage information. So when someone doesn't know or remember what arguments the script takes (perhaps you, 6 months from now), they can just invoke the script with no arguments to find out:
A good shell script should also be relatively easy to call. That's why, for example, we have this script takes only a short name of the desired reference and uses it to select the correct path, and only requires the name of the R1 fastq for paired-end reads, using that path to determine the name of the R2 fastq file. While we won't go into the details of that defaulting in this discussion, and these specific choices may not be appropriate for your environment, you might want to look at those parts of the script for ideas on how to accomplish similar goals.
Finally, the last few lines of the script should declare success in a way that can be grep'd for. Ours uses this boilerplate text:
We can check that all of our scripts have done their proper work using something like this:
This will print the number of log files that have the magic success words, and we can compare that number against the number of scripts we actually ran.
The real work!
After the first part of align_bwa.sh has performed some initial error checks and established the execution environment, the script gets about doing the real work. For example, when doing a single-end alignment, it makes a call to bwa aln passing the pathname prefix for the indexed reference genome files and the input fastq file name, then redirecting the output (which normally goes to standard output) to a .sai file named using the output prefix specified by the user. We then use our "belts and suspenders" approach to error checking to make sure all went well.
Note that .sai is a proprietary binary format used by bwa. Most aligners have some equivalent "intermediate" format that can then be translated in to a .sam or .bam file. For bwa, the command to extract alignments is samse (single end alignment) or sampe (paired end alignment). Here's what the single-end call looks like:
The call to bwa samse requires the same pathname prefix for the indexed reference genome files and input fastq file name passed to bwa aln. It also takes the .sai binary alignment file name. In addition, we provide read group information (the -r "$RG" option) which will be stored in the .bam header (see the script comments for more information).
Since we want a binary .bam as output, but bwa samse (and sampe) produce .sam text output, we pipe the .sam file output to samtools view to convert it to .bam output, which is then redirected to an output file named using the user's output prefix. This command chaining or "piping" avoids having to write then read an intermediate .sam file. Note the dash on the samtools view -b -S - command line means samtools should look for its input data on standard input instead of in a file.
When aligning paired-end reads, bwa aligns each set of read ends independently, then uses pairing information when the alignments are extracted (for example, to compute the insert size between reads where both ends aligned). So the call to bwa sampe in our script takes arguments for fastq and .sai files for each end.
At this point the .sam/.bam file produced has a header, and then one line for each read end that was processed. Read pairs are listed one after the other, in the same name order as the input fastq file: this is referred to as read name ordering. While useful for some applications, most downstream tools (such as the IGV visualization program) require a .bam that is sorted by location (location ordered). A location consists a contig name, as defined in the original .fasta file used to generate the reference index (e.g. chr14) and a start position. The names of the contigs and their lengths are kept in the .sam/.bam header, which is why the header is required for sorting.
The actual bam sorting and indexing are straightforward calls to samtools (although you might want to check out the -m maximum memory option for samtools sort; it can speed up sorting of large files considerably):
Finally, we call samtools flagstat to report alignment statistics:
To summarize the statistics from all your (possibly parallel) alignments, you could do something like this: