Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...


Step 1:  Evaluate and manipulate raw data

Code Block
titleCount and Fastqc
cd quality_exercise
 
head data/Sample1_R1.fastq 

wc -l data/Sample1_R1.fastq 
grep -c '^@HWI'  data/Sample1_R1.fastq 


module spider fastqc
module load fastqc
fastqc -h
#DONT RUN ON HEAD NODE:
fastqc data/Sample1_R1.fastq

 

Look at Fastqc results:

Ideal dataset: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html

Our dataset: http://web.corral.tacc.utexas.edu/BioITeam/rnaseq_course/fastqc_exercise/Sample1_R1_fastqc/fastqc_report.html

 

Code Block
titlemanipulate fastq files- quality trimming and filtering
module spider fastx
module load fastx_toolkit
#DONT RUN ON HEAD NODE:
fastx_trimmer -i data/Sample1_R1.fastq -l 90 -Q 33 -o Sample1_R1.trimmed.fastq
#DONT RUN ON HEAD NODE:
fastq_quality_filter -q 20 -p 80 -i data/Sample1_R1.fastq -Q 33 -o Sample1_R1.filtered.fastq


grep -c '^@HWI' results/Sample1_R1.trimmed.fastq 
grep -c '^@HWI' results/Sample1_R1.filtered.fastq 
Code Block
titlemanipulate fastq files- adaptor trimming
fastx_clipper -h
#DONT RUN ON HEAD NODE:
/corral-repl/utexas/BioITeam/bin/cutadapt-1.3/bin/cutadapt -m 22 -a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC Sample1_R1.fastq
/corral-repl/utexas/BioITeam/bin/cutadapt-1.3/bin/cutadapt -m 22 -a TGATCGTCGGACTGTAGAACTCTGAACGTGTAGA Sample1_R1.fastq 

Step 2: Map Raw Data to Reference, Assess results

Code Block
titleMap with BWA
cd mapping_exercise
module spider bwa
module load bwa/0.7.7


#DONT RUN ON HEAD NODE:
bwa index -a bwtsw reference/genome.fa

bwa mem

#DONT RUN ON HEAD NODE:
bwa mem reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq > C1_R1.mem.sam

Code Block
titleMap with Tophat
module spider tophat
module load tophat/2.0.10

#DONT RUN ON HEAD NODE
tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq

 

Code Block
titleAssess Mapping Results
module load samtools

#SYNTAX:
samtools view -b -S samfile > bamfile
samtools sort bamfile sortedbamfile
samtools index sortedbamfile


#BWA RESULTS
samtools flagstat bwa_results/C1_R1.mem.bam
samtools idxstats bwa_results/C1_R1.mem.bam


#SAMTOOLS RESULTS
samtools flagstat tophat_results/C1_R1_thout/accepted_hits.bam

#How is tophat result different from bwa
head -20 tophat_results/C1_R1_thout/accepted_hits.sam
cut -f 6 tophat_results/C1_R1_thout/accepted_hits.sam|grep 'N'|head
cut -f 6 bwa_results/C1_R1.mem.sam|grep 'N'|head

#DONT RUN ON HEAD NODE
cut -f 6 tophat_results/C1_R1_thout/accepted_hits.sam|grep 'N'|wc -l

Step 3 and Step 4 :  Quantify transcripts and ID DEGs
Code Block
titleLook at bedtools and DESeq results
cd deg_exercise/
 
module swap intel gcc/4.7.1
module load bedtools
bedtools
bedtools multicov
#DONT RUN ON HEAD NODE
bedtools multicov -bams C1_R1_thout/accepted_hits.bam C1_R2_thout/accepted_hits.bam C1_R3_thout/accepted_hits.bam C2_R1_thout/accepted_hits.bam C2_R2_thout/accepted_hits.bam C2_R3_thout/accepted_hits.bam -bed reference/genes.formatted.gtf > gene_counts.gff

#What's the difference between bedtools and htseq results?
 
#First look at the annotation file
head ../mapping_exercise/reference/genes.formatted.gtf 

#bedtools result 
head gene_counts.gff 
wc -l gene_counts.gff
 
#htseq result
head gene_counts.htseq.gff 
wc -l gene_counts.htseq.gff
 
#DESEQ2 result:
head deseq_results/DESeq-C1-vs-C2.csv 


Step 2: Assemble Transcripts/Tuxedo Suite

CUFFLINKS OUTPUT: TRANSCRIPTS.GTF FILE

Each row corresponds to one transcript or an exon of a transcript. First 7 columns are standard gff/gtf columns, followed by some attributes added by cufflinks.

Column numberColumn nameExampleDescription
1seqnamechrXChromosome or contig name
2sourceCufflinksThe name of the program that generated this file (always 'Cufflinks')
3featureexonThe type of record (always either "transcript" or "exon".)
4start77696957The leftmost coordinate of this record.
5end77712009The rightmost coordinate of this record, inclusive.
6score77712009Score that tells you how abundant this isoform is compared to other isoforms of the gene
7strand+Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "."
7frame.Not used
8attributes...See below.

    Each GTF record is decorated with the following attributes:
AttributeExampleDescription
gene_idCUFF.1Cufflinks gene id
transcript_idCUFF.1.1Cufflinks transcript id
FPKM101.267Isoform-level relative abundance in FPKM
frac0.7647Reserved. Please ignore.
   
cov100.765Estimate for the absolute depth of read coverage across the whole transcript
full_read_supportyes

When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data.

Code Block
titleLook at cufflinks results
cd cufflinks_cuffdiff_exercise
ls -l C1_R1_clout
head C1_R1_clout/transcripts.gtf

 

Find out how many entries in the transcripts.gtf file are from novel transcripts and how many from annotated transcripts

Code Block
titleParsing transcripts.gtf file
The secret lies in the gene_id column.

#For counting novel entries 
grep 'CUFF' C1_R1_clout/transcripts.gtf |wc -l

54847
 
#For counting entries corresponding to annotated genes
grep -v 'CUFF' C1_R1_clout/transcripts.gtf |wc -l

88644
 
What do you think grep -v does?

 

CUFFMERGE OUTPUT

Code Block
titleLook at CUFFMERGE results
head merged_asm/merged.gtf

CLASS CODES

PriorityCodeDescription
1=Complete match of intron chain
2cContained 
3jPotentially novel isoform (fragment): at least one splice junction is shared with a reference transcript 
4eSingle exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment. 
5iA transfrag falling entirely within a reference intron 
6oGeneric exonic overlap with a reference transcript 
7pPossible polymerase run-on fragment (within 2Kbases of a reference transcript) 
8rRepeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case 
9uUnknown, intergenic transcript 
10xExonic overlap with reference on the opposite strand 
11s

An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors)

Code Block
titleLook at CUFFMERGE results
#count number of transcripts that are potentially novel
grep 'class_code "j"' merged_asm/merged.gtf|cut -f 9|cut -d ';' -f 5|sort|uniq|wc -l

 

CUFFDIFF OUTPUT

Code Block
titleLook at cuffdiff results
ls -l diff_out
head diff_out/gene_exp.diff

Here is a basic command useful for parsing/sorting the gene_exp.diff or isoform_exp.diff files:

Code Block
titleLinux one-liner for sorting cuffdiff output by log2 fold-change values
cat diff_out/isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head

 

Find the 10 most up-regulated genes, by fold change that are classified as significantly changed. 

Code Block
titleOne-line command to get 10 most up regulated genes
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nr,10|head

 
#Lets pull out just gene names
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nr,10|head|cut -f 3
Expand
Solution
Solution

Top 10 upregulated genes

scf

crc

KdelR

Nacalpha

CG8979

CG4389

Vha68-2

regucalcin

CG3835

CG1746

 

Find the 10 most down-regulated genes, by fold change that are classified as significantly changed. 

Code Block
titleOne-line command to get 10 most down regulated genes
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10n,10|head|cut -f 3
Expand
Solution
Solution

Top 10 downregulated genes

CG17065

Git

CG18519

del

CG13319

RNaseX25

msb1l

achi

CG15611

CG11309

 

Find the 10 most up-regulated isoforms, by fold change that are classified as significantly changed. What genes do they belong to?

Code Block
titleOne-line command to get 10 most up-regulated isoforms
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head

 
#To pull out gene names:
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head |cut -f 3
Expand
Solution
Solution

sick

mub

CG5021

akirin

CG4587

c(3)G

Ank2

CG1674

CrebB-17A

stai

 

Pull out and count the significantly differentially expressed genes do we have (pvalue <=0.05 and abs fold change > 2 or log2 fold change > 1)?

Code Block
titleDEG list
cat diff_out/gene_exp.diff |grep 'yes' |awk '{if (($10 >= 1)||($10<=1)) print }' > DEG

wc -l DEG


 

...