Linux refresher

This is more a cram-session on Linux, like learning to drive if you had to drive your younger sister to the hospital when you're 10 because you just stuck a knife in her leg accidentally and didn't want your parents to know.

All the Exercises listed below will work on Lonestar.

Getting to a remote computer

ssh is an executable program that runs on your local computer. It connects securely (i.e. encrypted) to a remote computer that you specify. ssh programs exist for all major operating systems - Windows, Mac, and linux. Mac and linux come with these commands built-in; Windows needs some help. If you're using a Windows box and are part of UT Austin, Bevoware provides two free ssh programs, "ssh secure client" and "putty". We won't describe how to use these here.

If you're on a linux box or a mac, open a terminal window and log in to lonestar with your TACC login credentials using ssh:

SSH to access Lonestar at TACC

ssh <your user ID>@lonestar.tacc.utexas.edu

When you log in to a linux computer, the operating system checks your login credentials and if they're OK it sets up some configuration for you and then runs a program called a "shell" which acts like your fast-food drive-thru window to the rest of the operating system. You type commands and hit "enter" to send something into the drive-thru window, and then the OS passes output back through the drive-thru window.

Every time you exchange stuff through this window, it's within a context, like one specific drive-thru window at one restaurant. The directory within the file system is one part of that context; the programs and environment variables available to you are other parts of that context. When you log in, the system and shell agree that you'll start off in your home directory on the system.

Essential command-line tricks to look like an expert quickly, or figure out what's going on.

Type as little and as accurately as possible by cheating:

Cheat 1: Use "up arrow" to retrieve any of the last 500 commands you've typed. You can then edit them and hit enter (even in the middle of the command) and the shell will use that command.

This taps into a feature of the shell - your history. The command history will print to the screen the last 500 commands you've typed. You can modify this number if you'd like. VERY USEFUL TIP - every so often do history >> what which will write your history to a file called "what". I leave these lying around in directories so I can remember what I was doing, how I generated output data, etc. These can often become the basis for a shell script (we'll get to those). Advanced topic: use history to be super-fast at the command line.

Cheat 2: Hit the tab key twice - it's almost always magic. This instructs the shell to try to guess what you're doing and finish the typing for you. On most modern linux shells, it works for commands (like "ls" or "scp") and for completing file or directory names.

This is really useful if you can't remember whether fasta2fastq.sh is fastaToFastq or fastaToFastq.sh or Fasta2fastq or Fasta2Fastq.sh or something else. It's also helpful for reconstructing directory paths or filenames on-the-fly.

You might find write out a long command with a ton of options in the terminal and then find out that you misspelled something at the very beginning of the line. It can be really annoying to hold down the arrow key to get back to that point.

Cheat 3: You can use control-a (holding down the "control" key and "a") to jump the cursor right to the beginning of the line. The omega to that alpha is control-e, which jumps the cursor to the end of the line. Arrow keys work, and <Ctrl>-arrow will skip by word forward and backward.

Unfortunately, you are pretty much out of luck if you want to jump to the middle of the line. In this case you might want to copy the whole command into a nice text editor on your desktop, change it, and copy it back. Advanced topic: command line editors.

Exercise

Type "modu" then hit tab twice - it presents two choices, module and modutil. Type the next character l, hit tab twice and it will complete the rest of the typing. If you hit tab twice again, the OS will show you all the files in your current working directory which doesn't make any sense for the command "module" - it's smart, but not smart enough to figure out that the next word in the command needs to be one of module's built-in commands.

Inline help

Man pages - linux has had built-in help files since the mid-1500's, way before Macs or PCs thought of such things. In linux they're called man pages - short for "manual"; it's not a gender thing (I assume). man intro will give you an introduction to all user commands.

Exercise:

Try "man grep", or "man du", or "man sort" - you'll want these sometime.

Tip: Type the letter q to quit man, j and k/<CR> to move up and down by line, b or spacebar up/down by page. Want to search? Just hit the slash key /, enter the search word and hit enter. These are actually the tools of the less command which man is using.

Basic linux commands you need to know like breathing air

ls - list the contents of the current directory
pwd - print the present working directory - which restaurant am I at right now - the format is something like /home/myID - just like on most computer systems, this represents leaves on the tree of the file system structure, also called a "path".
cd <whereto> - change the present working directory to <whereto> - pick up my drive-thru window (shell) and move it so that I'm now looking thru to the directory <whereto>
- Some special <wheretos>: .. (period, period) means "up one level". ~ (tilde) means "my home directory". ~myfriend (tilde "myfriend) means "myfriend's home directory".
df shows you the top level of the directory structure of the system you're working on, along with how much disk space is available
head <file> and tail <file> shows you the top or bottom 10 lines of a file <file>
more <file> and less <file> both display the contents of <file> in nice ways. Read the bit above about man to figure out how to navigate and search when using less
file <file> tells you what kind of file <file> is.
cat <file> outputs all the contents of <file> - CAUTION - only use on small files.
rm <file> deletes a file. This is permanent - not a "trash can" deletion.
cp <source> <destination> copies the file source to the location and/or file name destination}. Using . (period) means "here, with the same name". * cp -r <dirname> <destination> will recursively copy the directory dirname and all its contents to the directory destination.
scp <user>@<host>:<source> <destination> works just like cp but copies source from the user user's directory on remote machine host to the local file destination
mkdir <dirname> and rmdir <dirname> make and remove the directory "dirname". This only removes empty directories - "rm -r <dirname>" will remove everything.
wget <url> fetches a file with a valid URL. It's not that common but we'll use wget to pull data from one of TACC's web-based storage devices.

Exercises:

Use variables to store where you are, move away, and then back. Try this and see if you can figure out what the shell is doing for you:

Practice some linux basics

pwd
here=`pwd`
cd /scratch/01057
pwd
cd $here
pwd

Scavenger hunt practice; on Lonestar issue the following commands:

Play a scavenger hunt for more practice

cp -r /corral-repl/utexas/BioITeam/linuxpractice .
cd linuxpractice
cd what
cat readme

and follow the instructions. Hints: use <tab><tab> to fill in filenames as much as you can.

Options: the lifeblood of linux commands

Sitting at the computer, you should have some idea what you need to do. There's probably a command to do it. If you have some idea what it starts with, you can type a few characters and hit tab twice to get some help. If you have no idea, you google it or ask someone else. But soon you want those commands to do a bit more - like seeing the sizes of files in addition to their names.

Most commands in linux use a common syntax to ask more of a command; they usually add a dash "-" followed by a code letter that means "do the basic command, but with a bit more..."

Useful options for ls

ls -l
ls -lh
ls -t

These little toggle-like things are often called "command line switches"; there can be other options, like filenames, that aren't switches.

Almost all commands, and especially NGS tools, use options heavily.

Like dialects in a language, there are at least three basic schemes commands/programs accept options in:

One letter options which can sometimes be combined, or other single options like:
Examples of different option types
```
head -10
ls -lhtS (equivalent to ls -l -h -t -S)
```

Word options, like -d64 and -Xms512m in this command, that are never combined (this is the GATK command to call SNPs):

Examples of word options

java -d64 -Xms512m -Xmx4g -jar /work/01866/phr254/gshare/Tools_And_Programs/bin/GenomeAnalysisTK.jar -glm BOTH -R $reference -T UnifiedGenotyper -I
$outprefix.realigned.recal.bam --dbsnp $dbsnp -o $outprefix.snps.vcf -metrics snps.metrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000
-A DepthOfCoverage -A AlleleBalance

"Long option" forms, using the convention that a single dash - precedes single-letter options, and double dashes- - precede word options, like this command to run the mira assembler:
Example of long options
```
mira --project=ct --job=denovo,genome,accurate,454 -SK:not=8
```

man pages should detail all options available for a command. Unless there's no man page.

More help please

Sometimes man lets you down - no man page. Don't fret, try one of these:

Just type in the command and hit return - it will usually try to help you.
Type the command followed by one of: -h -help --help -? and may give you some help.
Sometimes the command by itself will give you short help, and will list the magic option for full help.

Exercise

First do:

module load blast

Now figure out how to run some kind of blast program on lonestar with options. Hints: try <tab><tab>, man, running some blast command, use options to figure out other options.

I've put nr, nt, and refseq_rna blast databases on Lonestar here:
/corral-repl/utexas/BioITeam/blastdb/
along with a test sequence: the human JAG1 gene, here:
/corral-repl/utexas/BioITeam/sphsmith/jag1.fa

Hint/solution

blastn -query jag1.fa -db /corral-repl/utexas/BioITeam/blastdb/nt -evalue 1e-100
But of course you wouldn't run this on the head node - you'd instead enter it into a file called "commands" using a text editor and then do:

echo "blastn -query jag1.fa -db /corral-repl/utexas/BioITeam/blastdb/nt -evalue 1e-100" > commands
/corral-repl/utexas/BioITeam/ngs_course/launcher_creator.py -l blast.sge -n blast_jag1 -t 00:30:00 -j commands
qsub blast.sge

Editing files

Editing text files is very common; you need to find an editor you like and get proficient at it.

Like all things Linux, there is a trade-off between power and learning curve. The two most powerful editors on linux systems are emacs and vi (these links take you to online manuals/tutorials); they are virtually programming languages themselves, making them both powerful and nontrivial to learn.

Two very easy-to-use editors are nano and pico; nano exists on Lonestar.

Editing a text file with nano

nano commands

Try now to create the file commands using a text editor and enter or edit the blast command from the previous exercise, changing the output from full blast alignments to the more parse-able format given by the option -outfmt 6. Run it if you have time.

Piping and redirection

It's a pain to have to order for your kids at the drive-thru; sometimes you'd like them to order directly and have the food go directly to them instead of through you. In a linux shell, this is called redirection. It uses a familiar metaphor: "pipes".

The linux operating system expects some "standard input pipe" and gives output back through a "standard output pipe". These are called "stdin" and "stdout" in linux. There's also a special "stderr" for errors; we'll ignore that for now.

Usually, your shell is filling the operating system's stdin with stuff you type - the commands with options. The shell passes responses back from those commands to stdout, which the shell usually dumps to your screen.

The ability to switch stdin and stdout around is one of the key reasons linux has existed for decades and beat out many other operating systems.

The syntax for doing this switching around can be confusing because it uses codes.

Exercise

Redirect stdout of the ls -1 command to the file whatsHere

Redirecting STDOUT

ls -1 > whatsHere
cat whatsHere

Redirect stdout of ls -1 to the head -2 command, then to the file whatsHere

Piping one command's output to another, and then redirecting STDOUT to a file

ls -1 | head -2 > whatsHere
cat whatsHere

So: the redirect output (stdout) character is >, and the "pass output along as input" is the pipe character |.

Redirect stdout of ls -1 to the head -2 command, then to the file whatsHere, then use that file to list the sizes of those two files

Piping, redirection, and command substitution

ls -1 | head -2 > whatsHere
cat whatsHere
ls -l `cat whatsHere`

(You may have to copy and paste that last command - it uses the backtick character ` to tell the shell to execute the command cat whatsHere and then use the result as an option to the ls -l command).

Perfect - now my kids (i.e. other commands) can order (stdin) through the same drive-thru window and get their kid-pack (stdout) directly!

Not all shells are equal - the bash shell lets you redirect stdout with either > or 1>; stderr can be redirected with 2>; you can redirect both stdout and stderr using &>. If these don't work, use google to try to figure it out. The web site stackoverflow is a usually trustworthy and well annotated site for OS and shell help.

Confused about commands vs. programs?

A command is something entered to stdin that the OS understands how to run. That command might be a list of other commands, a command built-in to the linux operating system, or an executable program (sometimes called a binary).

Command is the general term, but, for example, when we run samtools we're entering the command samtools to run the executable binary program samtools.

The operating system must have previously been told where the actual executable file called samtools exists in the filesystem, so when a user enters the command samtools it knows where to find the executable samtools and run it.

Space shortcuts

Page tree

Linux refresher

Getting to a remote computer

Essential command-line tricks to look like an expert quickly, or figure out what's going on.

Exercise

Inline help

Exercise:

Basic linux commands you need to know like breathing air

Exercises:

Options: the lifeblood of linux commands

More help please

Exercise

Editing files

Piping and redirection

Exercise

Confused about commands vs. programs?