...
- Macs and Linux have a Terminal program built-in
- Windows options:
- Windows 10+
- Command Prompt program and PowerShell have ssh and scp (may require latest Windows updates)
- Start menu → Search for Command
- Windows Subsystem for Linux – Windows 10 Professional includes a Ubuntu-like bash shells
- See https://docs.microsoft.com/en-us/windows/wsl/install-win10
- We recommend the Ubuntu Linux distribution, but any Linux distribution will have an SSH client
- Command Prompt program and PowerShell have ssh and scp (may require latest Windows updates)
- or
- Putty – http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
- simple Terminal and file copy programs
- download either the Putty installer (https://the.earth.li/~sgtatham/putty/latest/w64/putty-64bit-0.7077-installer.msi)
- or just putty.exe (terminal) and pscp.exe (secure copy client)
- Cygwin – http://www.cygwin.com/
- A full Linux environment, including X-windows for running GUI programs remotely
- Complicated to install
- Putty – http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
- Windows 10+
...
- ls *.bam – lists all files in the current directory that end in .bam
- ls [A-Z]*.bam – does the same, but only if the first character of the file is a capital letter
- ls [ABcd]*.bam – lists all .bam files whose 1st letter is A, B, c or d.
- ls *.{fastq,fq}.gz – lists all .fastq.gz and .fq.gz files.
...
- samtools view converts the binary small.bam file to text and writes alignment record lines one at a time to standard output.
- -F 0x4 option says to filter out any records where the 0x4 flag bit is 0 (not set)
- since the 0x4 flag bit is set (1) for unmapped records, this says to only report records where the query sequence did map to the reference
- | head -1000
- the pipe connects the standard output of samtools view to the standard input of head
- the -1000 option says to only write the first 1000 lines of input to standard output
- | cut -f 5
- the pipe connects the standard output of head to the standard input of cut
- the -f 5 option says to only write the 5th field of each input line to standard output (input fields are tab-delimited by default)
- the 5th field of an alignment record is an integer representing the alignment mapping quality
- the resulting output will have one integer per line (and 1000 lines)
- | sort -n
- the pipe connects the standard output of cut to the standard input of sort
- the -n option says to sort input lines according to numeric sort order
- the resulting output will be 1000 numeric values, one per line, sorted from lowest to highest
- | uniq -c
- the pipe connects the standard output of sort to the standard input of uniq
- the -c option option says to just count groups of lines with the same value (that's why they must be sorted) and report the total for each group
- the resulting output will be one line for each group that uniq sees
- each line will have the text for the group (here the unique mapping quality values) and a count of lines in each group
...
There are three types of quoting in the shell:
- single quoting (e.g. 'some text') – this serves two purposes
- It groups together all text inside the quotes into a single argument that is passed to the command.
- It tells the shell not to "look inside" the quotes to perform any evaluations.
- Anything that looks like an environment variable or a bash meta-character is not evaluated.
- No pathname globbing (e.g. *) is performed.
- double quoting (e.g. "some text") – also serves two purposes
- It groups together all text inside the quotes into a single argument that is passed to the command.
- It allows environment variable evaluation (but inhibits pathname globbing).
- backtick quoting (e.g. `date`)
- It evaluates the expression inside the backticks.
- The resulting standard output of the expression replaces the backticked text.
...
- cut -f <field_number(s)> extracts one or more fields (-f) from each line of its input
- -d <delim> to change the field delimiter (Tab by default)
- sort sorts its input using an efficient algorithm
- by default sorts each line lexically
- one or more fields to sort can be specified with one or more -k <start_field_number>,<end_field_number> options
- has options to sort numerically (-n), or numbers-inside-text (version sort -V)
- -t <delim> to change the field delimiter (whitespace -- one or more spaces or Tabs – by default)
- by default sorts each line lexically
- uniq -c counts groupings of its input (which must be sorted) and reports the text and count for each group
- use cut | sort | uniq -c for a quick-and-dirty histogram (see piping a histogram)
grep -P '<pattern>' searches for <pattern> in its input and outputs only lines containing itAnchor GREP GREP - always enclose <pattern> in single quotes to inhibit shell evaluation!
- -P says use Perl patterns, which are much more powerful than standard grep patterns
- -c says just return a count of line matches
- -n says include the line number of the matching line
- -v (inverse match) says return only lines not matching the pattern
- -L says return only the names of files containing no pattern matches
- -l says return only the names of files that do contain the mattern match
- <pattern> can contain special match meta-characters and modifiers such as:
- ^ – matches beginning of line
- $ – matches end of line
- . – (period) matches any single character
- * – modifier; place after an expression to match 0 or more occurrences
- + – modifier, place after an expression to match 1 or more occurrences
- \s – matches any whitespace (\S any non-whitespace)
- \d – matches digits 0-9
- \w – matches any word character: A-Z, a-z, 0-9 and _ (underscore)
- \t matches Tab
- \r matches Carriage return
- \n matches Linefeed
- [xyz123] – matches any single character (including special characters) among those listed between the brackets [ ]
- this is called a character class.
- use [^xyz123] to match any single character not listed in the class
- (Xyz|Abc) – matches either Xyz or Abc or any text or expressions inside parentheses separated by | characters
- note that parentheses ( ) may also be used to capture matched sub-expressions for later use
- Regular expression modules are available in nearly every programming language (Perl, Python, Java, PHP, awk, even R)
- each "flavor" is slightly different
- even bash has multiple regex commands: grep, egrep, fgrep.
- There are many good online regular expression tutorials, but be sure to pick one tailored to the language you will use.
- here's a good general one: https://www.regular-expressions.info/
- and a perl regex tutorial: http://perldoc.perl.org/perlretut.html
- perl regular expressions are the "gold standard" used in most other languages
awk '<script>' a powerful scripting language that is easily invoked from the command lineAnchor AWK_script AWK_script - <script> is applied to each line of input (generally piped in)
- always enclose <script> in single quotes to inhibit shell evaluation
- General structure of an awk script:
- BEGIN{<expressions>} – use to initialize variables before any script body lines are executed
- e.g. BEGIN{FS=":"; OFS="\t"; sum=0} says
- use colon (:) as the input field separator (FS), and Tab (\t) as the output field separator (OFS)
- the default input field separator separator (FS) is whitespace
- one or more spaces or tabs
- the default output field separator (OFS) is a single space
- the default input field separator separator (FS) is whitespace
- initialize the variable sum to 0
- use colon (:) as the input field separator (FS), and Tab (\t) as the output field separator (OFS)
- e.g. BEGIN{FS=":"; OFS="\t"; sum=0} says
- {<body expressions>} – expressions to apply to each line of input
- use $1, $2, etc. to pick out specific input fields
- e.g. {print $3,$4} outputs fields 3 and 4 of the input, separated by the output field separator.
- END{<expressions>} – executed after all input is complete (e.g. print a sum)
- BEGIN{<expressions>} – use to initialize variables before any script body lines are executed
- Here is an excellent awk tutorial, very detailed and in-depth
- take a look once you feel comfortable with the example scripts we've gone over in class.
- <script> is applied to each line of input (generally piped in)
...
- samtools view converts each alignment record in yeast_pairedend.sort.bam to text
- the -F 0x4 filter says to output records only for mapped sequences (ones assigned a contig and position)
- BAM files often contain records for both mapped and unmapped reads
- -F filters out records where the specified bit(s) are not set (i.e., they are 0)
- so technically we're asking for "not unmapped" reads since bit 0x4 = 1 means unmapped
- the -f 0x2 filter says to output only reads that are flagged as properly paired by the aligner
- these are reads where both R1 and R2 reads mapped within a "reasonable" genomic distance
- -f filters out records where the specified bit(s) are set (i.e., they are 1)
- alignment records that pass both filters are written to standard output
- the -F 0x4 filter says to output records only for mapped sequences (ones assigned a contig and position)
- | awk
- the pipe | connects the standard output of samtools view to the standard input of awk
- the single quote denotes the start of the awk script
- we don't have to use line continuation characters ( \ followed by a linefeed) within the script because newline characters within the quotes are part of the script
- 'BEGIN{ ... }{...}END{...}'
- these 3 lines of text, enclosed in single quotes, are the awk script
- the BEGIN{ FS="\t"; sum=0; nrec=0; } block is executed once before the script processes any input data
- it says to use Tab("\t") as the input (FS) field separator (default is whitespace), and initialize the variables sum and nrec to 0.
- it says to use Tab("\t") as the input (FS) field separator (default is whitespace), and initialize the variables sum and nrec to 0.
- { if ($9 > 0) {sum += $9; nrec++} }
- this is the body of the awk script, which is executed for each line of input
- $9 represents the 9th tab-delimited field of the input
- the 9th field of an alignment record is the insert size, according to the SAM format spec
- we only execute the main part of the body when the 9th field is positive: if ($9 > 0)
- since each proper pair will have one alignment record with a positive insert size and one with a negative insert size, this check keeps us from double-counting insert sizes for pairs
- when the 9th field is positive, we add its value to sum (sum += $9) and add one to our record count (nrec++)
- END{ print sum/nrec; }
- the END block between the curly brackets { } is executed once after the script has processed all input data
- this prints the average insert size (sum/nrec) to standard output
process multiple files with a for loop
...