...
We will also use two additional RNA-seq datasets, which are located at:
Code Block |
---|
$CLASSDIR/corral-repl/utexas/BioITeam/core_ngs_tools/human_stuff |
File Name | Description | Sample |
---|
Sample_Yeast_L005_R1.cat.fastq.gz | Paired-end Illumina, First of pair, FASTQ | Yeast ChIP-seq |
Sample_Yeast_L005_R2.cat.fastq.gz | Paired-end Illumina, Second of pair, FASTQ | Yeast ChIP-seq |
human_rnaseq.fastq.gz | Paired-end Illumina, First of pair only, FASTQ | Human RNA-seq |
human_mirnaseq.fastq.gz | Single-end Illumina, FASTQ | Human microRNA-seq |
...
Exercise: How many contigs are there in the sacCer3 reference?
Expand |
---|
|
Code Block |
---|
| grep -P '^>' sacCer3.fa | wc -l |
or Or use grep's -c option that says "just count the line matches" Code Block |
---|
| grep -P -c '^>' sacCer3.fa |
|
Expand |
---|
|
There are 17 contigs. |
Now, we're ready to execute the actual alignment, with the goal of initially producing a SAM file from the input FASTQ files and reference. First go to the align directory, and link to the sacCer3 reference directory (this will make our commands more readable).
...
What does bwa aln needs in the way of arguments?
There are lots of options, but here is a summary of the most important ones. BWA, is a lot more complex than the options let on. If you look at the BWA manual on the web for the aln sub-command, you'll see numerous options that can increase the alignment rate (as well as decrease it), and all sorts of other things.
...
You did it! You should now have a SAM file that contains the alignments. ItIt's just a text file, so take a look with head, more, less, tail, or whatever you feel like. In In the next section, with samtools, you'll learn some additional ways to analyze the data once you create a BAM file.
...
Expand |
---|
|
This looks for the pattern '^HWI' which is the start of every read name (which starts every alignment record). Remember -c says just count the records, don't display them. Code Block |
---|
| grep -P -c '^@' yeast_pairedend.sam |
Or use the -v (invert) option to tell grep to print all lines that don't match a particular pattern, here the header lines starting with @. Code Block |
---|
| grep -P -v -c '^HWI' yeast_pairedend.sam |
|
Expand |
---|
|
There are 1184360 1,184,360 alignment records. |
Exercise: How many sequences were in the R1 and R2 FASTQ files combined?
...
Expand |
---|
|
There were a total of 1184360 1,184,360 original sequences |
Exercises:
...
Code Block |
---|
title | bowtie2 index files for miRNAs |
---|
|
hairpin_cDNA_hsa.fa
hairpin_cDNA_hsa.fa.1.bt2
hairpin_cDNA_hsa.fa.2.bt2
hairpin_cDNA_hsa.fa.3.bt2
hairpin_cDNA_hsa.fa.4.bt2
hairpin_cDNA_hsa.fa.rev.1.bt2
hairpin_cDNA_hsa.fa.rev.2.bt2 |
Performing the bowtie2 local alignment
Now, we're ready to actually try to do the alignment. Remember, unlike BWA, we actually need to set some options depending on what we're after. Some of the important options for bowtie2 are:
...
- This is one alignment record, although it has been broken up below for readability.
- This read mapped to the mature microRNA sequence hsa-mir-302b, starting at base 50 in that contig.
- Notice the CIGAR string is 3S20M13S, meaning that 3 bases were soft clipped from one end (3S), and 13 from the other (13S).
- If we did the same alignment using either bowtie2 --end-to-end mode, or using bwa aln as in Exercise #1, very little of this file would have aligned.
- The 20M part of the CIGAR string says there was a block of 20 read bases that mapped to the reference.
- If we had not lowered the seed parameter of bowtie2 from its default of 22, we would not have found many of the alignments like this one that only matched for 20 bases.
...
The test file we will be working with is just the R1 file from a paired-end total RNA-seq experiment, meaning it is (for our purposes) single-end. Go ahead and take a look at it, and find out how many reads are in the file.
Expand |
---|
|
Code Block |
---|
| cds
less fastq_align/human_rnaaseq.fastq.gz
gunzip -c fastq_align/human_rnaseq.fastq.gz | echo $((`wc -l`/4)) |
|
RNA-seq alignment with bwa aln
...
Code Block |
---|
bwa mem <ref.fa> <reads.fq> > outfile.sam |
Expand |
---|
|
Code Block |
---|
| bwa mem hg19/hg19.fa fq/human_rnaseq.fastq.gz > human_rnaseq_mem.sam |
|
Check the length of the SAM file you generated with wc -l. Since there is one alignment per line, there must be 586266 alignments (minus no more than 100 header lines), which is more than the number of sequences in the FASTQ file. This is bwa mem can report multiple alignment records for the same read, hopefully on either side of a splice junction. These alignments can still be tied together because they have the same read ID.
...