Page History

...

Macs and Linux have a Terminal program built-in
Windows options:
- Windows 10+
  - Command Prompt program and PowerShell have ssh and scp (may require latest Windows updates)
    - Start menu → Search for Command
  - Windows Subsystem for Linux – Windows 10 Professional includes a Ubuntu-like bash shells
    - See https://docs.microsoft.com/en-us/windows/wsl/install-win10
    - We recommend the Ubuntu Linux distribution, but any Linux distribution will have an SSH client
- or
  - Putty – http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
    - simple Terminal and file copy programs
    - download either the Putty installer (https://the.earth.li/~sgtatham/putty/latest/w64/putty-64bit-0.7077-installer.msi)
      - or just putty.exe (terminal) and pscp.exe (secure copy client)
  - Cygwin – http://www.cygwin.com/
    - A full Linux environment, including X-windows for running GUI programs remotely
    - Complicated to install

...

ls *.bam – lists all files in the current directory that end in .bam
ls [A-Z]*.bam – does the same, but only if the first character of the file is a capital letter
ls [ABcd]*.bam – lists all .bam files whose 1st letter is A, B, c or d.
ls *.{fastq,fq}.gz – lists all .fastq.gz and .fq.gz files.

...

samtools view converts the binary small.bam file to text and writes alignment record lines one at a time to standard output.
- -F 0x4 option says to filter out any records where the 0x4 flag bit is 0 (not set)
- since the 0x4 flag bit is set (1) for unmapped records, this says to only report records where the query sequence did map to the reference
| head -1000
- the pipe connects the standard output of samtools view to the standard input of head
- the -1000 option says to only write the first 1000 lines of input to standard output
| cut -f 5
- the pipe connects the standard output of head to the standard input of cut
- the -f 5 option says to only write the 5th field of each input line to standard output (input fields are tab-delimited by default)
  - the 5th field of an alignment record is an integer representing the alignment mapping quality
  - the resulting output will have one integer per line (and 1000 lines)
| sort -n
- the pipe connects the standard output of cut to the standard input of sort
- the -n option says to sort input lines according to numeric sort order
- the resulting output will be 1000 numeric values, one per line, sorted from lowest to highest
| uniq -c
- the pipe connects the standard output of sort to the standard input of uniq
- the -c option option says to just count groups of lines with the same value (that's why they must be sorted) and report the total for each group
- the resulting output will be one line for each group that uniq sees
- each line will have the text for the group (here the unique mapping quality values) and a count of lines in each group

...

There are three types of quoting in the shell:

single quoting (e.g. 'some text') – this serves two purposes
- It groups together all text inside the quotes into a single argument that is passed to the command.
- It tells the shell not to "look inside" the quotes to perform any evaluations.
  - Anything that looks like an environment variable or a bash meta-character is not evaluated.
  - No pathname globbing (e.g. *) is performed.
double quoting (e.g. "some text") – also serves two purposes
- It groups together all text inside the quotes into a single argument that is passed to the command.
- It allows environment variable evaluation (but inhibits pathname globbing).
backtick quoting (e.g. `date`)
- It evaluates the expression inside the backticks.
- The resulting standard output of the expression replaces the backticked text.

...

cut -f <field_number(s)> extracts one or more fields (-f) from each line of its input
- -d <delim> to change the field delimiter (Tab by default)
sort sorts its input using an efficient algorithm
- by default sorts each line lexically
  - one or more fields to sort can be specified with one or more -k <start_field_number>,<end_field_number> options
- has options to sort numerically (-n), or numbers-inside-text (version sort -V)
- -t <delim> to change the field delimiter (whitespace -- one or more spaces or Tabs – by default)
uniq -c counts groupings of its input (which must be sorted) and reports the text and count for each group
use cut | sort | uniq -c for a quick-and-dirty histogram (see piping a histogram)
Anchor
GREP
GREP
grep -P '<pattern>' searches for <pattern> in its input and outputs only lines containing it
always enclose <pattern> in single quotes to inhibit shell evaluation!
-P says use Perl patterns, which are much more powerful than standard grep patterns
-c says just return a count of line matches
-n says include the line number of the matching line
-v (inverse match) says return only lines not matching the pattern
-L says return only the names of files containing no pattern matches
-l says return only the names of files that do contain the mattern match
<pattern> can contain special match meta-characters and modifiers such as:
^ – matches beginning of line
$ – matches end of line
. – (period) matches any single character
* – modifier; place after an expression to match 0 or more occurrences
+ – modifier, place after an expression to match 1 or more occurrences
\s – matches any whitespace (\S any non-whitespace)
\d – matches digits 0-9
\w – matches any word character: A-Z, a-z, 0-9 and _ (underscore)
\t matches Tab
\r matches Carriage return
\n matches Linefeed
[xyz123] – matches any single character (including special characters) among those listed between the brackets [ ]
this is called a character class.
use [^xyz123] to match any single character not listed in the class
(Xyz|Abc) – matches either Xyz or Abc or any text or expressions inside parentheses separated by | characters
note that parentheses ( ) may also be used to capture matched sub-expressions for later use
Regular expression modules are available in nearly every programming language (Perl, Python, Java, PHP, awk, even R)
each "flavor" is slightly different
even bash has multiple regex commands: grep, egrep, fgrep.
There are many good online regular expression tutorials, but be sure to pick one tailored to the language you will use.
here's a good general one: https://www.regular-expressions.info/
and a perl regex tutorial: http://perldoc.perl.org/perlretut.html
perl regular expressions are the "gold standard" used in most other languages
Anchor
AWK_script
AWK_script
awk '<script>' a powerful scripting language that is easily invoked from the command line
<script> is applied to each line of input (generally piped in)
always enclose <script> in single quotes to inhibit shell evaluation
General structure of an awk script:
BEGIN{<expressions>} – use to initialize variables before any script body lines are executed
e.g. BEGIN{FS=":"; OFS="\t"; sum=0} says
use colon (:) as the input field separator (FS), and Tab (\t) as the output field separator (OFS)
the default input field separator separator (FS) is whitespace
one or more spaces or tabs
the default output field separator (OFS) is a single space
initialize the variable sum to 0
{<body expressions>} – expressions to apply to each line of input
use $1, $2, etc. to pick out specific input fields
e.g. {print $3,$4} outputs fields 3 and 4 of the input, separated by the output field separator.
END{<expressions>} – executed after all input is complete (e.g. print a sum)
Here is an excellent awk tutorial, very detailed and in-depth
take a look once you feel comfortable with the example scripts we've gone over in class.

...

samtools view converts each alignment record in yeast_pairedend.sort.bam to text
the -F 0x4 filter says to output records only for mapped sequences (ones assigned a contig and position)
BAM files often contain records for both mapped and unmapped reads
-F filters out records where the specified bit(s) are not set (i.e., they are 0)
so technically we're asking for "not unmapped" reads since bit 0x4 = 1 means unmapped
the -f 0x2 filter says to output only reads that are flagged as properly paired by the aligner
these are reads where both R1 and R2 reads mapped within a "reasonable" genomic distance
-f filters out records where the specified bit(s) are set (i.e., they are 1)
alignment records that pass both filters are written to standard output
| awk
the pipe | connects the standard output of samtools view to the standard input of awk
the single quote denotes the start of the awk script
we don't have to use line continuation characters ( \ followed by a linefeed) within the script because newline characters within the quotes are part of the script
'BEGIN{ ... }{...}END{...}'
these 3 lines of text, enclosed in single quotes, are the awk script
the BEGIN{ FS="\t"; sum=0; nrec=0; } block is executed once before the script processes any input data
it says to use Tab("\t") as the input (FS) field separator (default is whitespace), and initialize the variables sum and nrec to 0.
{ if ($9 > 0) {sum += $9; nrec++} }
this is the body of the awk script, which is executed for each line of input
$9 represents the 9th tab-delimited field of the input
the 9th field of an alignment record is the insert size, according to the SAM format spec
we only execute the main part of the body when the 9th field is positive: if ($9 > 0)
since each proper pair will have one alignment record with a positive insert size and one with a negative insert size, this check keeps us from double-counting insert sizes for pairs
when the 9th field is positive, we add its value to sum (sum += $9) and add one to our record count (nrec++)
END{ print sum/nrec; }
the END block between the curly brackets { } is executed once after the script has processed all input data
this prints the average insert size (sum/nrec) to standard output

process multiple files with a for loop

...

Page tree

Versions Compared

Old Version 88

New Version 89

Key

process multiple files with a for loop