This page should serve as a reference for the many "things Linux" we use in this course.

Terminal programs

You need a Terminal program in order to ssh to a remote computer.

Getting around in the shell

Important keyboard shortcuts

Type as little and as accurately as possible by using keyboard shortcuts!

Tab key completion

The Tab key is your best friend! Hit the Tab key once or twice - it's almost always magic! Hitting Tab invokes shell completion, instructing the shell to try to guess what you're doing and finish the typing for you. On most modern Linux shells, Tab completion will:

Arrow keys

Command line editing

Wildcards and special file names

The shell has shorthand to refer to groups of files by allowing wildcards in file names.

* (asterisk) is the most common filename wildcard. It matches "any length of any characters".

This technique is sometimes called filename globbing, and the pattern a glob.

Other useful ones are

For example:

Three special file names:

  1. . (single period) means "this directory".
  2. .. (two periods) means "directory above current." So ls .. means "list contents of the parent directory."
  3. ~ (tilde) means "my home directory".

While it is possible to create file and directory names that have embedded spaces, that creates problems when manipulating them.

To avoid headaches, it is best not to create file/directory names with embedded spaces.

Standard streams

Every command and Linux program has three "built-in" streams: standard input, standard output and standard error.

It is easy to not notice the difference between standard output and standard error when you're in an interactive Terminal session – because both outputs are sent to the Terminal. But they are separate streams, with different meanings. When running batch programs and scripts you will want to manipulate standard output and standard error from programs appropriately.

redirecting output

To see the difference between standard output and standard error try these commands:

# redirect a long listing of your $HOME directory to a file
ls -la $HOME > cmd.out
# look at the contents -- you'll see just files
cat cmd.out

# this command gives an error because the target does not exist
ls -la bad_directory

# redirect any errors from ls to a file
ls -la bad_directory 2> cmd.out
# look at the contents -- you'll see an error message
cat cmd.out

# now redirect both error and output streams to the same place
ls -la bad_directory $HOME > cmd.out
# look at the contents -- you'll see both an error message and files
cat cmd.out


The power of the Linux command line is due in no small part to the power of piping. The pipe symbol ( | ) connects one program's standard output to the next program's standard input.

A simple example is piping uncompressed data "on the fly" to a pager like more (or less):

# zcat is like cat, except that it understands the gz compressed format,
# and uncompresses the data before writing it to standard output.
# So, like cat, you need to be sure to pipe the output to a pager if
# the file is large.
zcat big.fq.gz | more

# Another way to do the same thing is to use gunzip and provide the -c option,
# which says to write decompressed data to the stdout (-c for "console")
gunzip -c big.fq.gz | more

piping a histogram

But the real power of piping comes when you stitch together a string of commands with pipes – it's incredibly flexible, and fun once you get the hang of it.

For example, here's a simple way to make a histogram of mapping quality values from a subset of BAM file records.

# create a histogram of mapping quality scores for the 1st 1000 mapped bam records
samtools view -F 0x4 small.bam | head -1000 | cut -f 5 | sort -n | uniq -c

Environment variables

Environment variables are just like variables in a programming language (in fact bash is a complete programming language), they are "pointers" that reference data assigned to them. In bash, you assign an environment variable as shown below:

export varname="Some value, here it's a string"

Careful – do not put spaces around the equals sign when assigning environment variable values.

Also, always use double quotes if your value contains (or might contain) spaces.

You set environment variables using the bare name (varname above).

You then refer to or evaluate an environment variables using a dollar sign ( $ ) before the name:

echo $varname

The export keyword when you're setting ensures that any sub-processes that are invoked will inherit this value. Without the export only the current shell process will have that variable set.

Use the env command to see all the environment variables you currently have set.

Quoting in the shell

What different quote marks mean in the shell and when to use can be quite confusing.

There are three types of quoting in the shell:

  1. single quoting (e.g. 'some text') – this serves two purposes
  2. double quoting (e.g. "some text") – also serves two purposes
  3. backtick quoting (e.g. `date`)

Using Commands

Command options

Sitting at the computer, you should have some idea what you need to do. There's probably a command to do it. If you have some idea what it starts with, you can type a few characters and hit Tab twice to get some help. If you have no idea, you Google it or ask someone else.

Once you know a basic command, you'll soon want it to do a bit more - like seeing the sizes of files in addition to their names.

Most built-in commands in Linux use a common syntax to ask more of a command. They usually add a dash ( - ) followed by a code letter that names the added function. These "command line switches" are called options.

Options are, well, optional – you only add them when you need them. The part of the command line after the options, like filenames, are called arguments. Arguments can also be optional, but you can tell them from options because they don't start with a dash.

# long listing option (-l)
ls -l

# long listing (-l), all files (-a) and human readable file sizes (-h) options. $HOME is an argument (directory name)
ls -l -a -h $HOME

# sort by modification time (-t) displaying a long listing (-l) that includes the date and time
ls -lt

Almost all built-in Linux commands, and especially NGS tools, use options heavily.

Like dialects in a language, there are at least three basic schemes commands/programs accept options in:

  1. Single-letter short options, which start with a single dash ( - ) and can often be combined, like:

    head -20 # show 1st 20 lines
    ls -lhtS (equivalent to ls -l -h -t -S)
  2. Long options use the convention that double dashes ( -- ) precede the multi-character option name, and they can never be combined. Strictly speaking, long options should be separated from their values by the equals sign ( = ) according to the POSIX standard (see But most programs let you use a space as separator also. Here's an example using the mira genome assembler:

    mira --project=ct --job=denovo,genome,accurate,454 -SK:not=8
  3. Word options, illustrated in the GATK command line to call SNPs below.

java -Xms512m -Xmx4g -jar /work2/projects/BioITeam/common/opt/GenomeAnalysisTK.jar -glm BOTH -R $reference -T UnifiedGenotyper -I $outprefix.realigned.recal.bam --dbsnp $dbsnp -o $outprefix.snps.vcf -metrics snps.metrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000 -A DepthOfCoverage -A AlleleBalance

Getting help

So you've noticed that options can be complicated – not to mention program arguments. Some options have values and others don't. Some are short, others long. How do you figure out what kinds of functions a command (or NGS tool) offers? You need help!

--help option

Many (but not all) built-in shell commands will give you some help if you provide the long --help option. This can often be many pages, so you'll probably want to pipe the output to a pager like more. This is most useful to remind yourself what the name of that dang option was, assuming you know something about it.

-h or -? options

The -h and -? options are similar to --help. If --help doesn't work, try -h. or -?. Again, output can be lengthy and best used if you already have an idea what the program does.

just type the program name

Many 3rd party tools will provide extensive usage information if you just type the program name then hit Enter.

For example:


Produces something like this:

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.16a-r1181
Contact: Heng Li <>

Usage:   bwa <command> [options]

Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         shm           manage indices in shared memory
         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
      There are three alignment algorithms in BWA: `mem', `bwasw', and
      `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
      first. Please `man ./bwa.1' for the manual.

Notice that bwa, like many NGS programs, is written as a set of sub-commands. This top-level help displays the sub-commands available. You then type bwa <command> to see help for the sub-command:

bwa index

Displays something like this:

Usage:   bwa index [options] <in.fasta>

Options: -a STR    BWT construction algorithm: bwtsw or is [auto]
         -p STR    prefix of the index [same as fasta name]
         -b INT    block size for the bwtsw algorithm (effective with -a bwtsw) [10000000]
         -6        index files named as <in.fasta>.64.* instead of <in.fasta>.*

Warning: `-a bwtsw' does not work for short genomes, while `-a is' and


If you don't already know much about a command (or NGS tool), just Google it! Try something like "bwa manual" or "rsync man page". Many tools have websites that combine tool overviews with detailed option help. Even for built-in Linux commands, you're likely to get hits of a tutorial style, which are more useful when you're getting started.

And it's so much easier to read things in a nice web browser!

man pages

Linux had built-in help files way before Macs or PCs thought of such things. They're called man pages (short for manual).

For example, man intro will give you an introduction to all user commands.

man pages will detail all options available – in excruciating detail (unless there's no man page (smile)), so the manual system has its own built-in pager. The pager is sort of like less, but not quite the same (why make it easy?). We recommend man pages only for advanced users.

Basic linux commands you need to know

Here's a Linux commands cheat sheet. You may want to print a copy.

And here's  a set of commands you should know, by category (under construction).

Most built-in Linux commands that obtain data from command line arguments (such as file names) can also accept the data piped in on their standard input.

File system navigation

Create, rename, link to, delete files

Displaying file contents

Copying files and directories

Miscellaneous commands

Advanced commands

cut, sort, uniq, grep, awk

cut versus awk

The basic functions of cut and awk are similar – both are field oriented. Here are the main differences:

calculate average insert size

Here is an example awk script that works in conjunction with samtools view to calculate the average insert size for properly paired reads in a BAM file produced by a paired-end alignment:

samtools view -F 0x4 -f 0x2 yeast_pe.sort.bam | awk '
  BEGIN{ FS="\t"; sum=0; nrec=0; }
 { if ($9 > 0) {sum += $9; nrec++;} }
  END{ print sum/nrec; }'

process multiple files with a for loop

The general structure of a for loop in bash are shown below. Different portions of the structure can be separated on different lines (like <something> and <something else> below) or put on one line separated with a semicolon ( ; ) like before the do keyword below.

for <variable name> in <expression>; do 
  <something else>

One common use of for loops is to process multiple files, where the set of files to process is obtained by pathname wildcarding. For example, the code below

for fname in *.gz; do
   echo "$fname has $((`zcat $fname | wc -l` / 4)) sequences"

Here fname is the name given the variable that is assigned a different filename each time through the loop. The set of such files is generated by the filename wildcard matching *.gz. The actual file is then referenced as $fname inside the loop.

The bash shell lets you put multiple commands on one line if they are each separated by a semicolon ( ; ). So in the above for loop, you can see that bash considers the do keyword to start a separate command. Two alternate ways of writing the loop are:

# One line for each clause, no semicolons
for <variable name> in <expression>
  <something else>

# All on one line, with semicolons separating clauses
for <variable name> in <expression>; do <something>; <something else>; done

Copying files between TACC and your laptop

Assume you want to copy the TACC file $SCRATCH/core_ngs/fastq_prep/small_fastqc.html back to your laptop/local computer. You must initiate the copy operation from your local computer rather than at TACC. Why? because the TACC servers have host names and IP addresses that are public in the Internet's Distributed Name Service (DNS) directory. But your local computer (in nearly all cases) does not have a published name and address.

First, on the TACC server figure out what the appropriate absolute path (a.k.a. full pathname) is.

cd $SCRATCH/core_ngs/fastq/prep
pwd -P

This will return something like /scratch/01063/abattenh/core_ngs/fastq_prep.

For folks with Mac or Linux laptops (or running Windows Subsystem for Linux, or a Command window with scp available):

scp .

Windows users can use the free WinSCP program:

Using pscp.exe on Windows

You can also use pscp.exe, a remote file copy program that should have been installed with PuTTY (

To use pscp.exe, first open a Command window (Start menu, search for Command). Then in the Command window, see if it is on your Windows %PATH% by just typing the executable name:


If this shows usage information, you're good to go. Execute something like following, substituting your user name and absolute path:

cd c:\Scratch
pscp.exe .

If pscp.exe is not on your %PATH%, you may need to locate the program. Try this:

cd "c:\Program Files"\putty

If you see the program pscp.exe, you're good. You just have to use its full path. For example:

cd c:\Scratch
"c:\Program Files"\putty\pscp.exe 

Editing files

There are several options for editing files at TACC. These fall into three categories:

  1. Linux command-line text editors installed at TACC (nano, vi, emacs). These run in your Terminal window. 
  2. Text editors or IDEs that run on your local computer but have an SFTP (secure FTP) interface that lets you connect to a remote computer
  3. Software that will allow you to mount your home directory on TACC as if it were a normal disk

Knowing the basics of at least one Linux text editor is useful for creating small files like TACC commands files. We'll use nano and basic emacs for this in class.

For editing larger files, you may find options #2 or #3 more useful.


nano is a very simple editor available on most Linux systems. If you are able to ssh into a remote system, you can use nano there.

To invoke nano to edit a new or existing file just type:

nano <filename>

You'll see a short menu of operations at the bottom of the terminal window. The most important are:

You can just type in text, and navigate around using arrow keys. A couple of other navigation shortcuts:

Be careful with long lines – sometimes nano will split long lines into more than one line, which can cause problems in commands files


emacs is a complex, full-featured editor available on most Linux systems.

To invoke emacs to edit a new or existing file just type:

emacs <filename>

Here's a reference sheet that list many commands: The most important are:

You can just type in text, and navigate around using arrow keys. A couple of other navigation shortcuts:

Be careful when pasting text into an emacs buffer – it takes a few seconds before emacs is ready to accept the full pasted text.

Double-check that the 1st line of pasted test is correct – emacs can clip the 1st few characters if the paste is done too soon.

Line ending nightmares

The dirty little secret of the computer world is that the three main "families" of computers – Macs, Windows and Linux/Unix – use different, mutually incompatible line endings.

And guess what? Most Linux programs don't work with files that have Windows or Mac line endings, and what's worse they give you bizarre error messages that don't give you a clue what's going on!

So whatever non-Linux text editor you use, be sure to adjust its "line endings" setting – and it better have one somewhere!

Komodo Edit for Mac and Windows

Komodo Edit is a free, full-featured text editor with syntax coloring for many programming languages and a remote file editing interface. It has versions for both Macintosh and Windows. Download the appropriate install image here.

Once installed, start Komodo Edit and follow these steps to configure it:

When you want to open an existing file at stampede2, do the following:

To create and save a new file, do the following:

Rather than having to navigate around TACC's complex file system tree, it helps to use the symbolic links to those areas that we created in your home directory.

Notepad++ for Windows

Notepad++ is an open source, full-featured text editor for Windows PCs (not Macs). It has syntax coloring for many programming languages (python, perl, bash), and a remote file editing interface.

If you're on a Windows PC download the installer here.

Once it has been installed, start Notepad++ and follow these steps to configure it:

To open the connection, click the blue (Dis)connect icon then select your stampede2 connection. It should prompt for your password. Once you've authenticated, a directory tree ending in your home directory will be visible in the NppFTP window. You can click the the (Dis)connect icon again to Disconnect when you're done.

Rather than having to navigate around TACC's complex file system tree, it helps to use the symbolic links to those areas that we created in your Home directory (~/work2 or ~/scratch).