...
- The -P option tells grep to Perl-style regular expression patterns.
- This makes including special characters like Tab ( \t ), Carriage Return ( \r ) or Linefeed ( \n ) much easier that the default POSIX paterns.
- While it is not required here, it generally doesn't hurt to include this option.
'^>^>' is the regular expression describing the pattern we're looking for (described below)
- sacCer3.fa is the file to search.
- lines with text that match our pattern will be written to standard output
- non matching lines will be omitted
- We pipe to more just in case there are a lot of contig names.
Now down to the nuts and bolts of the pattern: '^>^>'
First, the single quotes around the pattern – this tells the bash shell to pass the exact string contents to grep.
As part of its friendly command line parsing and evaluation, the shell will often look for special characters on the command line that mean something to it (for example, the $ in front of an environment variable name, like in $SCRATCH). Well, regular expressions treat the $ specially too – but in a completely different way! Those single quotes tell the shell "don't look inside here for special characters – treat this as a literal string and pass it to the program". The shell will obey, will strip the single quotes off the string, and will pass the actual pattern, ^>^>, to the grep program. (Note that the shell does look inside double quotes ( " ) for certain special signals, such as looking for environment variable names to evaluate. Read more about Quoting in the shell.)
So what does ^>^> mean to grep? We know that contig name lines always start with a > character, so > is a literal for grep to use in its pattern match.
We might be able to get away with just using this literal alone as our regex, specifying '>' as the command line argument. But for grep, the more specific the pattern, the better. So we constrain where the > can appear on the line. The special carat ( ^ ) metacharacter represents "beginning of line". So ^>^> means "beginning of a line followed by a > character".
...
alignment type | aligner options | pro's | con's |
---|---|---|---|
global with bwa | SEsingle end reads:
PEpaired end reads:
|
|
|
global with bowtie2 | bowtie2 --global |
|
|
local with bwa | bwa mem |
|
|
local with bowtie2 | bowtie2 --local |
|
|
...
Like other tools you've worked with so far, you first need to load bwa. Do that now, and then enter bwa with no arguments to view the top-level help page (many NGS tools will provide some help when called with no arguments). Note that bwa is available both from the standard TACC module system and as as a BioContainers. module.
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
|
Code Block | ||
---|---|---|
| ||
module load biocontainers # takes a while module load bwa bwa |
...
Expand | |||||
---|---|---|---|---|---|
| |||||
The last few lines of bwa's execution output should look something like this:
So the R2 alignment took ~123 ~78 seconds (~2 ~1.3 minutes). |
Since you have your own private compute node, you can use all its resources. It has 68 128 cores, so re-run the R2 alignment asking for 60 execution threads.
...
Expand | |||||
---|---|---|---|---|---|
| |||||
The last few lines of bwa's execution output should look something like this:
So the R2 alignment took only ~20 ~5 seconds (real time), or 615+ times as fast as with only one processing thread. Note, though, that the CPU time with 60 threads was greater (617 142.8 sec) than with only 1 thread (124 77.6 sec). That's because of the thread management overhead when using multiple threads. |
...
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
|
Code Block | ||
---|---|---|
| ||
# If not already loaded module load biocontainers # takes a while module load samtools samtools |
...
Code Block | ||
---|---|---|
| ||
Program: samtools (Tools for alignments in the SAM format) Version: 1.109 (using htslib 1.109) Usage: samtools <command> [options] Commands: -- Indexing dict create a sequence dictionary file faidx index/extract FASTA fqidx index/extract FASTQ index index alignment -- Editing calmd recalculate MD/NM tags and '=' bases fixmate fix mate information reheader replace BAM header targetcut cut fosmid regions (for fosmid pool only) addreplacerg adds or replaces RG tags markdup mark duplicates -- File operations collate shuffle and group alignments by name cat concatenate BAMs merge merge sorted alignments mpileup multi-way pileup sort sort alignment file split splits a file by read group quickcheck quickly check if SAM/BAM/CRAM file appears intact fastq converts a BAM to a FASTQ fasta converts a BAM to a FASTA -- Statistics bedcov read depth per BED region coverage alignment depth and percent coverage depth compute the depth flagstat simple stats idxstats BAM index stats phase phase heterozygotes stats generate stats (former bamcheck) -- Viewing flags explain BAM flags tview text alignment viewer view SAM<->BAM<->CRAM conversion depad convert padded BAM to unpadded BAM |
...
The samtools view utility provides a way of converting between SAM (text) and BAM (binary, compressed) format. It also provides many, many other functions which we will discuss lster. To get a preview, execute samtools view without any other arguments. You should see:
...
Looking at some of the alignment record information (e.g. samtools view yeast_pairedend.bam | cut -f 1-4 | more), you will notice that read names appear in adjacent pairs (for the R1 and R2), in the same order they appeared in the original FASTQ file. Since that means the corresponding mappings are in no particular order, searching through the file very inefficient. samtools sort re-orders entries in the SAM file either by locus (contig name + coordinate position) or by read name.
...
Code Block | ||
---|---|---|
| ||
Usage: samtools sort [options...] [in.bam] Options: -l INT Set compression level, from 0 (uncompressed) to 9 (best) -m INT Set maximum memory per thread; suffix K/M/G recognized [768M] -n Sort by read name -t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set) -o FILE Write final output to FILE rather than standard output -T PREFIX Write temporary files to PREFIX.nnnn.bam --no-PG do not add a PG line --input-fmt-option OPT[=VAL] Specify a single input file format option in the form of OPTION or OPTION=VALUE -O, --output-fmt FORMAT[,OPT[=VAL]]... Specify output format (SAM, BAM, CRAM) --output-fmt-option OPT[=VAL] Specify a single output file format option in the form of OPTION or OPTION=VALUE --reference FILE Reference sequence FASTA FILE [null] -@, --threads INT Number of additional threads to use [0] --verbosity INT Set level of verbosity |
In most cases you will be sorting a BAM file from name order to locus order. You can use either -o or redirection with > to control the output.
...
Expand | ||
---|---|---|
| ||
The yeast_pairedend.sam text file is the largest at ~348 MB. The name-ordered binary yeast_pairedend.bam text file only about 1/3 that size, ~110 ~111 MB. They contain exactly the same records, in the same order, but conversion from text to binary results in a much smaller file. The coordinate-ordered binary yeast_pairedend.sort.bam file is even slightly smaller, ~91 ~92 MB. This is because BAM files are actually customized gzip-format files. The customization allows blocks of data (e.g. all alignment records for a contig) to be represented in an even more compact form. You can read more about this in section 4 of the SAM format specification. |
...
Expand | ||
---|---|---|
| ||
While the yeast_pairedend.sort.bam text file is ~91 ~92 MB, its index (yeast_pairedend.sort.bai) is only 20 KB. |
...
Ignore the "+ 0" addition to each line - that is a carry-over convention for counting "QA-failed reads" that is no longer relevant.
...
Expand | |||||
---|---|---|---|---|---|
| |||||
Divide the number of properly paired reads by the number of mapped reads:
|
Expand | ||
---|---|---|
| ||
About 86% of mapped read were properly paired. This is actually a bit on the low side for ChIP-seq alignments which typically over 90%. |
...
Code Block | ||||
---|---|---|---|---|
| ||||
samtools idxstats yeast_pairedend.sort.bam | tee yeast_pairedend.idxstats.txt |
Here we use the tee command which reports its standard input to standard output before also writing it to the specified file.
Code Block | ||||
---|---|---|---|---|
| ||||
chrI 230218 8820 1640 chrII 813184 36616 4026 chrIII 316620 13973 1530 chrIV 1531933 72675 8039 chrV 576874 27466 2806 chrVI 270161 10866 1222 chrVII 1090940 50893 5786 chrVIII 562643 24672 3273 chrIX 439888 16246 1739 chrX 745751 31748 3611 chrXI 666816 28017 2776 chrXII 1078177 54783 10124 chrXIII 924431 40921 4556 chrXIV 784333 33070 3703 chrXV 1091291 48714 5150 chrXVI 948066 44916 5032 chrM 85779 3268 291 * 0 0 571392 |
...