Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In RNA-Seq, the abundance level of a gene is measured by the number of reads that map to that gene.  Once the reads have been mapped to our reference, we must now count the number of reads that map to RNA units of interest to obtain gene/exon/transcript counts. Here, we shall look at different methods for doing this.

 

In our RNA-Seq analysis paths, here we will be exploring this path:

Image Removed

Count reads mapping to genes

Get set up

Code Block
titleGet set up for the exercises
cds
cd my_rnaseq_course
cd day_1_partB/hisat3/gene_expression_exercise/results/

Bedtools

Bedtools is a great utility for working with sequence features and mapped reads in BAM, BED, VCF, and GFF formats.

...

Code Block
titleTo find the names of chromosomes in genome file
grep '^>' ../reference/genome.fa
Code Block
titleTo find the names of chromosomes in gtf file
cut -f 1 ../reference/genes.gtf |sort|uniq -c

...

Code Block
titleSed command
sed 's/^dmel_mitochondrion_genome/M/' reference/genes.gtf > reference/genes.formatted.gtf
cut -f 1 ../reference/genes.formatted.gtf |sort|uniq -c

...

Warning
titleWarning: To submit to queue
Code Block
nano commands.bedtools

bedtools multicov -bams hisat_results/C1_R1_thout/accepted_hits.sorted.bam hisat_results/C1_R2_thout/accepted_hits.sorted.bam hisat_results/C1_R3_thout/accepted_hits.sorted.bam hisat_results/C2_R1_thout/accepted_hits.sorted.bam hisat_results/C2_R2_thout/accepted_hits.sorted.bam hisat_results/C2_R3_thout/accepted_hits.sorted.bam -bed reference/genes.formatted.gtf > gene_counts.gff 

 
#use ctrl+x to quit out of nano
Code Block
launcher_creator.py -j commands.bedtools -n multicov -q normal -t 02:00:00 -a UT-2015-05-18 -l bedtools_launcher.slurm
sbatch bedtools_launcher.slurm    #use the reservation: CCBB_5.24.17PM 

HTseq

HTseq is another tool to count reads. bedtools has many many useful functions, and counting reads is just one of them. In contrast, HTseq is a specialized utility for counting reads.  HTseq is very slow and you need to run multiple command lines in order to do the same job as what bedtools multicov did. However, if you are looking for more fine grained control over how to count genes, especially when a read overlaps more than one gene/feature, htseq-count would be an option.

...

Code Block
titleLet's just take a look at the commands
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C1_R1_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count1.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C1_R2_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count2.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C1_R3_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count3.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C2_R1_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count4.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C2_R2_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count5.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C2_R3_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count6.gff 

 
join C1_count1.gff C1_count2.gff| join - C1_count3.gff | join - C2_count4.gff |join - C2_count5.gff|join - C2_count6.gff > gene_counts_HTseq.gff
#if you have many samples, use for-loop and join

...

Warning
titleWarning: To submit to queue
Code Block
nano commands.htseq
 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C1_R1_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count1.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C1_R2_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count2.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C1_R3_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count3.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C2_R1_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count4.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C2_R2_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count5.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C2_R3_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count6.gff 
Code Block
launcher_creator.py -j commands.htseq -n htseq -q normal -t 02:00:00 -a UT-2015-05-18 -l htseq_launcher.slurm
qsubsbatch htseq_launcher.slurm

AFTER THIS COMPLETES:

Code Block
join C1_count1.gff C1_count2.gff| join - C1_count3.gff | join - C2_count4.gff |join - C2_count5.gff|join - C2_count6.gff > gene_counts_HTseq.gff

#if you have many samples, use for-loop and join

...

Then take a peek at the results...

 

...

cds cp -r /corral-repl/utexas/BioITeam/rnaseq_course_2016/day_2/ my_rnaseq_course/ cd my_rnaseq_course/day_2/gene_expression_exercise
Code Block
Code Block
titleBEDTOOLS output
head gene_counts.gff

wc -l gene_counts.gff

...