...
In RNA-Seq, the abundance level of a gene is measured by the number of reads that map to that gene. Once the reads have been mapped to our reference, we must now count the number of reads that map to RNA units of interest to obtain gene/exon/transcript counts. Here, we shall look at different methods for doing this.
In our RNA-Seq analysis paths, here we will be exploring this path:
Image Removed
Count reads mapping to genes
Get set up
Code Block |
---|
title | Get set up for the exercises |
---|
|
cds
cd my_rnaseq_course
cd day_1_partB/hisat3/gene_expression_exercise/results/ |
Bedtools
Bedtools is a great utility for working with sequence features and mapped reads in BAM, BED, VCF, and GFF formats.
...
Code Block |
---|
title | To find the names of chromosomes in genome file |
---|
|
grep '^>' ../reference/genome.fa |
Code Block |
---|
title | To find the names of chromosomes in gtf file |
---|
|
cut -f 1 ../reference/genes.gtf |sort|uniq -c |
...
Code Block |
---|
|
sed 's/^dmel_mitochondrion_genome/M/' reference/genes.gtf > reference/genes.formatted.gtf
cut -f 1 ../reference/genes.formatted.gtf |sort|uniq -c |
...
Warning |
---|
title | Warning: To submit to queue |
---|
|
Code Block |
---|
nano commands.bedtools
bedtools multicov -bams hisat_results/C1_R1_thout/accepted_hits.sorted.bam hisat_results/C1_R2_thout/accepted_hits.sorted.bam hisat_results/C1_R3_thout/accepted_hits.sorted.bam hisat_results/C2_R1_thout/accepted_hits.sorted.bam hisat_results/C2_R2_thout/accepted_hits.sorted.bam hisat_results/C2_R3_thout/accepted_hits.sorted.bam -bed reference/genes.formatted.gtf > gene_counts.gff
#use ctrl+x to quit out of nano |
Code Block |
---|
launcher_creator.py -j commands.bedtools -n multicov -q normal -t 02:00:00 -a UT-2015-05-18 -l bedtools_launcher.slurm
sbatch bedtools_launcher.slurm #use the reservation: CCBB_5.24.17PM |
|
HTseq
HTseq is another tool to count reads. bedtools has many many useful functions, and counting reads is just one of them. In contrast, HTseq is a specialized utility for counting reads. HTseq is very slow and you need to run multiple command lines in order to do the same job as what bedtools multicov did. However, if you are looking for more fine grained control over how to count genes, especially when a read overlaps more than one gene/feature, htseq-count would be an option.
...
Code Block |
---|
title | Let's just take a look at the commands |
---|
|
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C1_R1_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count1.gff
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C1_R2_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count2.gff
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C1_R3_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count3.gff
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C2_R1_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count4.gff
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C2_R2_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count5.gff
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C2_R3_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count6.gff
join C1_count1.gff C1_count2.gff| join - C1_count3.gff | join - C2_count4.gff |join - C2_count5.gff|join - C2_count6.gff > gene_counts_HTseq.gff
#if you have many samples, use for-loop and join |
...
Warning |
---|
title | Warning: To submit to queue |
---|
|
Code Block |
---|
nano commands.htseq
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C1_R1_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count1.gff
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C1_R2_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count2.gff
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C1_R3_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count3.gff
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C2_R1_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count4.gff
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C2_R2_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count5.gff
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C2_R3_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count6.gff |
Code Block |
---|
launcher_creator.py -j commands.htseq -n htseq -q normal -t 02:00:00 -a UT-2015-05-18 -l htseq_launcher.slurm
qsubsbatch htseq_launcher.slurm |
AFTER THIS COMPLETES: Code Block |
---|
join C1_count1.gff C1_count2.gff| join - C1_count3.gff | join - C2_count4.gff |join - C2_count5.gff|join - C2_count6.gff > gene_counts_HTseq.gff
#if you have many samples, use for-loop and join |
|
...
Then take a peek at the results...
...
Code Block |
cds
cp -r /corral-repl/utexas/BioITeam/rnaseq_course_2016/day_2/ my_rnaseq_course/
cd my_rnaseq_course/day_2/gene_expression_exercise Code Block |
---|
|
head gene_counts.gff
wc -l gene_counts.gff |
...