...
Like other tools you've worked with so far, you first need to load bwa. Do that now, and then enter bwa with no arguments to view the top-level help page (many NGS tools will provide some help when called with no arguments). bwa is available as a BioContainers. module.
...
Make sure you're in
...
an idev session
Code Block | ||||
---|---|---|---|---|
| ||||
idev -m 120180 -N 1 -A OTH21164 -rCoreNGSday4 CoreNGS-Thu
# or
idev -m 90 -N 1 -A OTH21164 -p development |
Code Block | ||
---|---|---|
| ||
module load biocontainers # takes a while module load bwa bwa |
...
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir -p $SCRATCH/core_ngs/references/bwa/sacCer3 cd $SCRATCH/core_ngs/references/bwa/sacCer3 ln -ssf ../../fasta/sacCer3.fa ls -l |
...
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir -p $SCRATCH/core_ngs/alignment/yeast_bwa cd $SCRATCH/core_ngs/alignment/yeast_bwa ln -ssf -f ../fastq ln -ssf -f ../../references/bwa/sacCer3 |
...
Expand | |||||
---|---|---|---|---|---|
| |||||
|
Required arguments are a <prefix> of the bwa index files, and the input FASTQ file. There are lots of options, but here is a summary of the most important ones.
...
You should now have a SAM file (yeast_pe.sam) that contains the alignments. It's just a text file, so take a look with head, more, less, tail, or whatever you feel like. Later you'll learn additional ways to analyze the data with samtools once you create a BAM file.
Exercise: How many lines does the SAM file have? How does this compare to the number of input sequences (R1+R2)?
Expand | ||
---|---|---|
| ||
wc -l yeast_pe.sam reports 1,184,378 lines The alignment SAM file will contain records for both R1 and R2 reads, so we need to count sequences in both files. zcat ./fastq/Sample_Yeast_L005_R[12]*gz | wc -l | awk '{print $1/4}' reports 1,184,360 reads that were aligned So the SAM file has 18 more lines than the R1+R2 total. These are the header records that appear before any alignment records. |
It's just a text file, so take a look with head, more, less, tail, or whatever you feel like. Later you'll learn additional ways to analyze the data with samtools once you create a BAM file.
Exercise: Exercise: What kind of information is in the first lines of the SAM file?
...
Expand | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
This looks for the pattern '^HWI' which is the start of every read name (which starts every alignment record).
Or use the -v (invert) option to tell grep to print all lines that don't match a particular pattern; here, all header lines, which start with @.
|
Expand | ||
---|---|---|
| ||
There are 1,184,360 alignment records. |
Exercise: How many sequences were in the R1 and R2 FASTQ files combined?
.
| |||||
Expand | |||||
---|---|---|---|---|---|
| |||||
zcat fastq/Sample_Yeast_L005_R[12].cat.fastq.gz | wc -l | awk '{print $1/4}' |
Expand | ||
---|---|---|
| ||
There were a total of are 1,184,360 original sequences (R1s + R2s)alignment records. |
Exercises:
...
- Does the SAM file contain both mapped and un-mapped reads?
- What is the order of the alignment records in this SAM file?
Expand | ||
---|---|---|
| ||
Both R1 and R2 reads must have separate alignment records, because there were 1,184,360 R1+R2 reads and the same number of alignment records. The SAM file must contain both mapped and unmapped un-mapped reads, because there were 1,184,360 R1+R2 reads and the same number of alignment records. Alignment records occur in the same read-name order as they did in the FASTQ, except that they come in pairs. The R1 read comes 1st, then the corresponding R2. This is called read name ordering. |
...