In RNA-Seq, the abundance level of a gene is measured by the number of reads that map to that gene. Once the reads have been mapped to our reference, we must now count the number of reads that map to RNA units of interest to obtain gene/exon/transcript counts. Here, we shall look at different methods for doing this.

In our RNA-Seq analysis paths, here we will be exploring this path:

Image Removed

Count reads mapping to genes

Get set up

Code Block

title	Get set up for the exercises

cds
cd my_rnaseq_course
cd day_1_partB/hisat3/gene_expression_exercise/results/

Bedtools

Bedtools is a great utility for working with sequence features and mapped reads in BAM, BED, VCF, and GFF formats.

...

Code Block

title	To find the names of chromosomes in genome file

grep '^>' ../reference/genome.fa

Code Block

title	To find the names of chromosomes in gtf file

cut -f 1 ../reference/genes.gtf |sort|uniq -c

...

Code Block

title	Sed command

sed 's/^dmel_mitochondrion_genome/M/' reference/genes.gtf > reference/genes.formatted.gtf
cut -f 1 ../reference/genes.formatted.gtf |sort|uniq -c

...

Warning

title	Warning: To submit to queue

Code Block

nano commands.bedtools

bedtools multicov -bams hisat_results/C1_R1_thout/accepted_hits.sorted.bam hisat_results/C1_R2_thout/accepted_hits.sorted.bam hisat_results/C1_R3_thout/accepted_hits.sorted.bam hisat_results/C2_R1_thout/accepted_hits.sorted.bam hisat_results/C2_R2_thout/accepted_hits.sorted.bam hisat_results/C2_R3_thout/accepted_hits.sorted.bam -bed reference/genes.formatted.gtf > gene_counts.gff 

 
#use ctrl+x to quit out of nano

Code Block
launcher_creator.py -j commands.bedtools -n multicov -q normal -t 02:00:00 -a UT-2015-05-18 -l bedtools_launcher.slurm sbatch bedtools_launcher.slurm #use the reservation: CCBB_5.24.17PM

HTseq

HTseq is another tool to count reads. bedtools has many many useful functions, and counting reads is just one of them. In contrast, HTseq is a specialized utility for counting reads. HTseq is very slow and you need to run multiple command lines in order to do the same job as what bedtools multicov did. However, if you are looking for more fine grained control over how to count genes, especially when a read overlaps more than one gene/feature, htseq-count would be an option.

...

Code Block

title	Let's just take a look at the commands

$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C1_R1_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count1.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C1_R2_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count2.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C1_R3_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count3.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C2_R1_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count4.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C2_R2_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count5.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty --stranded reverse -i gene_id C2_R3_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count6.gff 

 
join C1_count1.gff C1_count2.gff| join - C1_count3.gff | join - C2_count4.gff |join - C2_count5.gff|join - C2_count6.gff > gene_counts_HTseq.gff
#if you have many samples, use for-loop and join

...

Warning

title	Warning: To submit to queue

Code Block

nano commands.htseq
 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C1_R1_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count1.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C1_R2_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count2.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C1_R3_thout/accepted_hits.sam reference/genes.formatted.gtf > C1_count3.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C2_R1_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count4.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C2_R2_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count5.gff 
$TACC_HTSEQ_DIR/scripts/htseq-count -m intersection-nonempty -i gene_id C2_R3_thout/accepted_hits.sam reference/genes.formatted.gtf > C2_count6.gff

Code Block
launcher_creator.py -j commands.htseq -n htseq -q normal -t 02:00:00 -a UT-2015-05-18 -l htseq_launcher.slurm qsubsbatch htseq_launcher.slurm

AFTER THIS COMPLETES:

Code Block
join C1_count1.gff C1_count2.gff\| join - C1_count3.gff \| join - C2_count4.gff \|join - C2_count5.gff\|join - C2_count6.gff > gene_counts_HTseq.gff #if you have many samples, use for-loop and join

...

Then take a peek at the results...

...

cds cp -r /corral-repl/utexas/BioITeam/rnaseq_course_2016/day_2/ my_rnaseq_course/ cd my_rnaseq_course/day_2/gene_expression_exercise

Code Block

title	BEDTOOLS output

head gene_counts.gff

wc -l gene_counts.gff

...

Space shortcuts

Page tree

Versions Compared

Old Version 12

New Version 13

Key

Count reads mapping to genes

Get set up

Bedtools

HTseq

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 12

New Version 13

Key

Count reads mapping to genes

Get set up

Bedtools

HTseq