Page History

...

We will also use two additional RNA-seq datasets, which are located at:

Code Block
$CLASSDIR/corral-repl/utexas/BioITeam/core_ngs_tools/human_stuff

File Name	Description	Sample
Sample_Yeast_L005_R1.cat.fastq.gz	Paired-end Illumina, First of pair, FASTQ	Yeast ChIP-seq
Sample_Yeast_L005_R2.cat.fastq.gz	Paired-end Illumina, Second of pair, FASTQ	Yeast ChIP-seq
human_rnaseq.fastq.gz	Paired-end Illumina, First of pair only, FASTQ	Human RNA-seq
human_mirnaseq.fastq.gz	Single-end Illumina, FASTQ	Human microRNA-seq

...

Exercise: How many contigs are there in the sacCer3 reference?

Expand

title	Hint

Code Block

language	bash

grep -P '^>' sacCer3.fa | wc -l

or Or use grep's -c option that says "just count the line matches"

Code Block

language	bash

grep -P -c '^>' sacCer3.fa

Expand

title	Answer

There are 17 contigs.

Performing the bwa alignment

Now, we're ready to execute the actual alignment, with the goal of initially producing a SAM file from the input FASTQ files and reference. First go to the align directory, and link to the sacCer3 reference directory (this will make our commands more readable).

...

What does bwa aln needs in the way of arguments?

Expand

title	Hint

Code Block

language	bash

bwa aln

There are lots of options, but here is a summary of the most important ones. BWA, is a lot more complex than the options let on. If you look at the BWA manual on the web for the aln sub-command, you'll see numerous options that can increase the alignment rate (as well as decrease it), and all sorts of other things.

...

You did it! You should now have a SAM file that contains the alignments. ItIt's just a text file, so take a look with head, more, less, tail, or whatever you feel like. In In the next section, with samtools, you'll learn some additional ways to analyze the data once you create a BAM file.

...

Expand

title	Hint

This looks for the pattern '^HWI' which is the start of every read name (which starts every alignment record).
Remember -c says just count the records, don't display them.

Code Block

language	bash

grep -P -c '^@' yeast_pairedend.sam

Or use the -v (invert) option to tell grep to print all lines that don't match a particular pattern, here the header lines starting with @.

Code Block

language	bash

grep -P -v -c '^HWI' yeast_pairedend.sam

Expand

title	Answer

There are 1184360 1,184,360 alignment records.

Exercise: How many sequences were in the R1 and R2 FASTQ files combined?

...

Expand

title	Answer

There were a total of 1184360 1,184,360 original sequences

Exercises:

...

Code Block

title	bowtie2 index files for miRNAs

hairpin_cDNA_hsa.fa
hairpin_cDNA_hsa.fa.1.bt2
hairpin_cDNA_hsa.fa.2.bt2
hairpin_cDNA_hsa.fa.3.bt2
hairpin_cDNA_hsa.fa.4.bt2
hairpin_cDNA_hsa.fa.rev.1.bt2
hairpin_cDNA_hsa.fa.rev.2.bt2

Performing the bowtie2 local alignment

Now, we're ready to actually try to do the alignment. Remember, unlike BWA, we actually need to set some options depending on what we're after. Some of the important options for bowtie2 are:

...

This is one alignment record, although it has been broken up below for readability.
This read mapped to the mature microRNA sequence hsa-mir-302b, starting at base 50 in that contig.
Notice the CIGAR string is 3S20M13S, meaning that 3 bases were soft clipped from one end (3S), and 13 from the other (13S).
- If we did the same alignment using either bowtie2 --end-to-end mode, or using bwa aln as in Exercise #1, very little of this file would have aligned.
The 20M part of the CIGAR string says there was a block of 20 read bases that mapped to the reference.
- If we had not lowered the seed parameter of bowtie2 from its default of 22, we would not have found many of the alignments like this one that only matched for 20 bases.

...

The test file we will be working with is just the R1 file from a paired-end total RNA-seq experiment, meaning it is (for our purposes) single-end. Go ahead and take a look at it, and find out how many reads are in the file.

Expand

title	Hint:

Code Block

language	bash

cds
less fastq_align/human_rnaaseq.fastq.gz
gunzip -c fastq_align/human_rnaseq.fastq.gz | echo $((`wc -l`/4))

RNA-seq alignment with bwa aln

...

Code Block
bwa mem <ref.fa> <reads.fq> > outfile.sam

Expand

title	Hint:

Code Block

language	bash

bwa mem hg19/hg19.fa fq/human_rnaseq.fastq.gz > human_rnaseq_mem.sam

Check the length of the SAM file you generated with wc -l. Since there is one alignment per line, there must be 586266 alignments (minus no more than 100 header lines), which is more than the number of sequences in the FASTQ file. This is bwa mem can report multiple alignment records for the same read, hopefully on either side of a splice junction. These alignments can still be tied together because they have the same read ID.

...

Page tree

Versions Compared

Old Version 53

New Version 54

Key

Performing the bwa alignment

Performing the bowtie2 local alignment

RNA-seq alignment with bwa aln