Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • wc -l  reports the number of lines (-l) in its input
  • history lists your command history to the terminal
    • redirect to a file to save a history of the commands executed in a shell session
    • pipe to grep to search for a particular command
  • which <pgm> searches all $PATH directories to find <pgm> and reports its full pathname

Copying files from TACC to your laptop

Assume you want to copy the TACC file $SCRATCH/core_ngs/fastq_prep/small_fastqc.html back to your laptop.

First, figure out what the appropriate absolute path (a.k.a. full pathname) is on TACC.

Code Block
languagebash
titleExecute this at TACC
cd $SCRATCH/core_ngs/fastq/prep
pwd -P

This will return something like /scratch/01063/abattenh/core_ngs/fastq_prep.

For folks with Mac or Linux laptops, just open a terminal window, cd to the directory where you want the files, and type something like following, substituting your user name and absolute path:

Code Block
languagebash
titleExecute this on your laptop
scp abattenh@ls5.tacc.utexas.edu:/scratch/01063/abattenh/core_ngs/fastq_prep/small_fastqc.html .

For Windows users, pscp.exe, a remote file copy program should have been installed with Putty. To use it, first open a Command window (Start menu, search for Cmd). Then in the Command window, see if it is on your Windows %PATH% by just typing the executable name:

Code Block
pscp.exe

If this shows usage information, you're good to go. Execute something like following, substituting your user name and absolute path:

Code Block
cd c:\Scratch
pscp.exe abattenh@ls5.tacc.utexas.edu:/scratch/01063/abattenh/core_ngs/fastq_prep/small_fastqc.html .

If pscp.exe is not on your %PATH%, you may need to locate the program. Try this:

Code Block
cd "c:\Program Files"\putty
dir

If you see the program pscp.exe, you're good. You just have to use its full path. For example:

Code Block
cd c:\Scratch
"c:\Program Files"\putty\pscp.exe abattenh@ls5.tacc.utexas.edu:/scratch/01063/abattenh/core_ngs/fastq_prep/small_fastqc.html 

Advanced commands

...

  • -d <delim> to change the field delimiter (tab by default)

...

...

Advanced commands

cut, sort, uniq, grep, awk

  • cut -f <field_number(s)> extracts one or more fields (-f) from each line of its input
    • -d <delim> to change the field delimiter (tab by default)
  • sort sorts its input using an efficient algorithm
    • by default sorts each line lexically, but one or more fields to sort can be specified with one or more -k <field_number> options
    • options to sort numerically (-n), or numbers-inside-text (version sort -V)
  • uniq -c counts groupings of its input (which must be sorted) and reports the text and count for each group
  • Anchor
    GREP
    GREP
    grep -P '<pattern>'
    searches for <pattern> in its input and outputs only lines containing it
    • always enclose <pattern> in single quotes to inhibit shell evaluation!
    • -P says use Perl patterns, which are much more powerful than standard grep patterns
    • -c says just return a count of line matches
    • -n says include the line number of the matching line
    • -v (inverse match) says return only lines not matching the pattern
    • -L says return only the names of files containing no pattern matches
    • <pattern> can contain special match meta-characters and modifiers such as:
      • ^ – matches beginning of line
      • $ – matches end of line
      • .  – (period) matches any single character
      • \s – matches any whitespace (\S any non-whitespace)
      • \d – matches digits 0-9
      • \w – matches any word character: A-Z, a-z, 0-9 and _ (underscore)
      • \t matches Tab; \r matches Carriage return; \n matches Linefeed
      • [xyz123] – matches any single character (including special characters) among those listed between the brackets [ ] x
        • this is called a character class.
        • use [^xyz123] to match any single character not listed in the class
      • (Xyz|Abc) – matches either Xyz or Abc or any text or expressions inside parentheses separated by | characters
        • note that parentheses ( ) may also be used to capture matched sub-expressions for later use
      • * – modifier; place after an expression to match 0 or more occurrences
      • + – modifier, place after an expression to match 1 or more occurrences
    • Regular expression modules are available in nearly every programming language (Perl, Python, Java, PHP, awk, even R)
      • each "flavor" is slightly different
      • even bash has multiple regex commands: grep, egrep, fgrep.
    • There are many good online regular expression tutorials, but be sure to pick one tailored to the language you will use.
  • Anchor
    AWK_script
    AWK_script
    awk
    '<script>' a powerful scripting language that is easily invoked from the command line
    • <script> is applied to each line of input (generally piped in)
      • always enclose <script> in single quotes to inhibit shell evaluation
    • General structure of an awk script:
      • {BEGIN <expressions>}  –  use to initialize variables before any script body lines are executed
        • e.g. {BEGIN FS="\t"; OFS="\t"; sum=0} says
          • use tab (\t) as the input (FS) and output (OFS) field separator (default is a space), and
          • initialize the variable sum to 0.
      • {<body expressions>}  – expressions to apply to each line of input
        • use $1, $2, etc. to pick out specific input fields
        • e.g. {print $3,$4} outputs fields 3 and 4 of the input
      • {END <expressions>} – executed after all input is complete (e.g. print a sum)
    • Here is an excellent awk tutorial, very detailed and in-depth
      •  take a look once you feel comfortable with the example scripts we've gone over in class.

calculate average insert size

Here is an example awk script that works in conjunction with samtools view to calculate the average insert size for properly paired reads in a BAM file produced by a paired-end alignment:

Code Block
languagebash
titleCalculating average insert size
samtools view -F 0x4 -f 0x2 yeast_pe.sort.bam | awk '
  BEGIN{ FS="\t"; sum=0; nrec=0; }
 { if ($9 > 0) {sum += $9; nrec++;} }
  END{ print sum/nrec; }'
  • samtools view converts each alignment record in yeast_pairedend.sort.bam to text
    • the -F 0x4 filter says to output records only for mapped sequences (ones assigned a contig and position)
      • BAM files often contain records for both mapped and unmapped reads
      • -F filters out records where the specified bit(s) are not set (i.e., they are 0)
        • so technically we're asking for "not unmapped" reads since bit 0x4 = 1 means unmapped
    • the -f 0x2 filter says to output only reads that are flagged as properly paired by the aligner
      • these are reads where both R1 and R2 reads mapped within a "reasonable" genomic distance
      • -f filters out records where the specified bit(s) are set (i.e., they are 1)
    • alignment records that pass both filters are written to standard output
  • | awk '
    • the pipe | connects the standard output of samtools view to the standard input of awk
    • the single quote denots the start of the awk script
      • we don't have to use line continuation characters ( \ followed by a linefeed) within the script because newline characters within the quotes are part of the script
  • 'BEGIN{ ... }{...}END{...}'
    • these 3 lines of text, enclosed in single quotes, are the awk script
    • the BEGIN{ FS="\t"; sum=0; nrec=0; } block is executed once before the script processes any input data
      • it says to use Tab ("\t") as the input (FS) field separator (default is whitespace), and initialize the variables sum and nrec to 0.
    • { if ($9 > 0) {sum += $9; nrec++}  }
      • this is the body of the awk script, which is executed for each line of input
      • $9 represents the 9th tab-delimited field of the input
        • the 9th field of an alignment record is the insert size, according to the SAM format spec
      • we only execute the main part of the body when the 9th field is positive: if ($9 > 0)
        • since each proper pair will have one alignment record with a positive insert size and one with a negative insert size, this check keeps us from double-counting insert sizes for pairs
      • when the 9th field is positive, we add its value to sum (sum += $9) and add one to our record count (nrec++)
    •  END{ print sum/nrec; }
      • the END block between the curly brackets { } is executed once after the script has processed all input data
      • this prints the average insert size (sum/nrec) to standard output

process multiple files with a for loop

The general structure of a for loop in bash are shown below. Different portions of the structure can be separated on different lines (like <something> and <something else> below) or put on one line separated with a semicolon ( ; ) like before the do keyword below.

Code Block
languagebash
for <variable name> in <expression>; do 
  <something>
  <something else>
done

One common use of for loops is to process multiple files, where the set of files to process is obtained by pathname wildcarding. For example, the code below

Code Block
languagebash
titleFor loop to count sequences in multiple FASTQs
for fname in *.gz; do
   echo "$fname has $((`zcat $fname | wc -l` / 4)) sequences"
done

Here fname is the name given the variable that is assigned a different filename each time through the loop. The set of such files is generated by the filename wildcard matching *.gz. The actual file is then referenced as$fname inside the loop.

Copying files from TACC to your laptop

Assume you want to copy the TACC file $SCRATCH/core_ngs/fastq_prep/small_fastqc.html back to your laptop.

First, figure out what the appropriate absolute path (a.k.a. full pathname) is on TACC.

Code Block
languagebash
titleExecute this at TACC
cd $SCRATCH/core_ngs/fastq/prep
pwd -P

This will return something like /scratch/01063/abattenh/core_ngs/fastq_prep.

For folks with Mac or Linux laptops, just open a terminal window, cd to the directory where you want the files, and type something like following, substituting your user name and absolute path:

Code Block
languagebash
titleExecute this on your laptop
scp abattenh@ls5.tacc.utexas.edu:/scratch/01063/abattenh/core_ngs/fastq_prep/small_fastqc.html .

For Windows users, pscp.exe, a remote file copy program should have been installed with Putty. To use it, first open a Command window (Start menu, search for Cmd). Then in the Command window, see if it is on your Windows %PATH% by just typing the executable name:

Code Block
pscp.exe

If this shows usage information, you're good to go. Execute something like following, substituting your user name and absolute path:

Code Block
cd c:\Scratch
pscp.exe abattenh@ls5.tacc.utexas.edu:/scratch/01063/abattenh/core_ngs/fastq_prep/small_fastqc.html .

If pscp.exe is not on your %PATH%, you may need to locate the program. Try this:

Code Block
cd "c:\Program Files"\putty
dir

If you see the program pscp.exe, you're good. You just have to use its full path. For example:

Code Block
cd c:\Scratch
"c:\Program Files"\putty\pscp.exe abattenh@ls5.tacc.utexas.edu:/scratch/01063/abattenh/core_ngs/fastq_prep/small_fastqc.html 
  • always enclose <pattern> in single quotes to inhibit shell evaluation!
  • -P says use Perl patterns, which are much more powerful than standard grep patterns
  • -c says just return a count of line matches
  • -n says include the line number of the matching line
  • -v (inverse match) says return only lines not matching the pattern
  • -L says return only the names of files containing no pattern matches
  • <pattern> can contain special match meta-characters and modifiers such as:
    • ^ – matches beginning of line
    • $ – matches end of line
    • .  – (period) matches any single character
    • \s – matches any whitespace (\S any non-whitespace)
    • \d – matches digits 0-9
    • \w – matches any word character: A-Z, a-z, 0-9 and _ (underscore)
    • \t matches Tab; \r matches Carriage return; \n matches Linefeed
    • [xyz123] – matches any single character (including special characters) among those listed between the brackets [ ] x
      • this is called a character class.
      • use [^xyz123] to match any single character not listed in the class
    • (Xyz|Abc) – matches either Xyz or Abc or any text or expressions inside parentheses separated by | characters
      • note that parentheses ( ) may also be used to capture matched sub-expressions for later use
    • * – modifier; place after an expression to match 0 or more occurrences
    • + – modifier, place after an expression to match 1 or more occurrences
  • Regular expression modules are available in nearly every programming language (Perl, Python, Java, PHP, awk, even R)
    • each "flavor" is slightly different
    • even bash has multiple regex commands: grep, egrep, fgrep.
  • There are many good online regular expression tutorials, but be sure to pick one tailored to the language you will use.

...

  • <script> is applied to each line of input (generally piped in)
    • always enclose <script> in single quotes to inhibit shell evaluation
  • General structure of an awk script:
    • {BEGIN <expressions>}  –  use to initialize variables before any script body lines are executed
      • e.g. {BEGIN FS="\t"; OFS="\t"; sum=0} says
        • use tab (\t) as the input (FS) and output (OFS) field separator (default is a space), and
        • initialize the variable sum to 0.
    • {<body expressions>}  – expressions to apply to each line of input
      • use $1, $2, etc. to pick out specific input fields
      • e.g. {print $3,$4} outputs fields 3 and 4 of the input
    • {END <expressions>} – executed after all input is complete (e.g. print a sum)
  • Here is an excellent awk tutorial, very detailed and in-depth
    •  take a look once you feel comfortable with the example scripts we've gone over in class.

calculate average insert size

Here is an example awk script that works in conjunction with samtools view to calculate the average insert size for properly paired reads in a BAM file produced by a paired-end alignment:

Code Block
languagebash
titleCalculating average insert size
samtools view -F 0x4 -f 0x2 yeast_pe.sort.bam | awk '
  BEGIN{ FS="\t"; sum=0; nrec=0; }
 { if ($9 > 0) {sum += $9; nrec++;} }
  END{ print sum/nrec; }'
  • samtools view converts each alignment record in yeast_pairedend.sort.bam to text
    • the -F 0x4 filter says to output records only for mapped sequences (ones assigned a contig and position)
      • BAM files often contain records for both mapped and unmapped reads
      • -F filters out records where the specified bit(s) are not set (i.e., they are 0)
        • so technically we're asking for "not unmapped" reads since bit 0x4 = 1 means unmapped
    • the -f 0x2 filter says to output only reads that are flagged as properly paired by the aligner
      • these are reads where both R1 and R2 reads mapped within a "reasonable" genomic distance
      • -f filters out records where the specified bit(s) are set (i.e., they are 1)
    • alignment records that pass both filters are written to standard output
  • | awk '
    • the pipe | connects the standard output of samtools view to the standard input of awk
    • the single quote denots the start of the awk script
      • we don't have to use line continuation characters ( \ followed by a linefeed) within the script because newline characters within the quotes are part of the script
  • 'BEGIN{ ... }{...}END{...}'
    • these 3 lines of text, enclosed in single quotes, are the awk script
    • the BEGIN{ FS="\t"; sum=0; nrec=0; } block is executed once before the script processes any input data
      • it says to use Tab ("\t") as the input (FS) field separator (default is whitespace), and initialize the variables sum and nrec to 0.
    • { if ($9 > 0) {sum += $9; nrec++}  }
      • this is the body of the awk script, which is executed for each line of input
      • $9 represents the 9th tab-delimited field of the input
        • the 9th field of an alignment record is the insert size, according to the SAM format spec
      • we only execute the main part of the body when the 9th field is positive: if ($9 > 0)
        • since each proper pair will have one alignment record with a positive insert size and one with a negative insert size, this check keeps us from double-counting insert sizes for pairs
      • when the 9th field is positive, we add its value to sum (sum += $9) and add one to our record count (nrec++)
    •  END{ print sum/nrec; }
      • the END block between the curly brackets { } is executed once after the script has processed all input data
      • this prints the average insert size (sum/nrec) to standard output

process multiple files with a for loop

The general structure of a for loop in bash are shown below. Different portions of the structure can be separated on different lines (like <something> and <something else> below) or put on one line separated with a semicolon ( ; ) like before the do keyword below.

Code Block
languagebash
for <variable name> in <expression>; do 
  <something>
  <something else>
done

One common use of for loops is to process multiple files, where the set of files to process is obtained by pathname wildcarding. For example, the code below

Code Block
languagebash
titleFor loop to count sequences in multiple FASTQs
for fname in *.gz; do
   echo "$fname has $((`zcat $fname | wc -l` / 4)) sequences"
done

...

Editing files
Anchor
Editing files
Editing files

...