...
Step 1: Evaluate and manipulate raw data
Code Block | ||
---|---|---|
| ||
cd quality_exercise head data/Sample1_R1.fastq wc -l data/Sample1_R1.fastq grep -c '^@HWI' data/Sample1_R1.fastq module spider fastqc module load fastqc fastqc -h #DONT RUN ON HEAD NODE: fastqc data/Sample1_R1.fastq |
Look at Fastqc results:
Ideal dataset: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html
Our dataset: http://web.corral.tacc.utexas.edu/BioITeam/rnaseq_course/fastqc_exercise/Sample1_R1_fastqc/fastqc_report.html
Code Block | ||
---|---|---|
| ||
module spider fastx module load fastx_toolkit #DONT RUN ON HEAD NODE: fastx_trimmer -i data/Sample1_R1.fastq -l 90 -Q 33 -o Sample1_R1.trimmed.fastq #DONT RUN ON HEAD NODE: fastq_quality_filter -q 20 -p 80 -i data/Sample1_R1.fastq -Q 33 -o Sample1_R1.filtered.fastq grep -c '^@HWI' results/Sample1_R1.trimmed.fastq grep -c '^@HWI' results/Sample1_R1.filtered.fastq |
Code Block | ||
---|---|---|
| ||
fastx_clipper -h #DONT RUN ON HEAD NODE: /corral-repl/utexas/BioITeam/bin/cutadapt-1.3/bin/cutadapt -m 22 -a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC Sample1_R1.fastq /corral-repl/utexas/BioITeam/bin/cutadapt-1.3/bin/cutadapt -m 22 -a TGATCGTCGGACTGTAGAACTCTGAACGTGTAGA Sample1_R1.fastq |
Step 2: Map Raw Data to Reference, Assess results
Code Block | ||
---|---|---|
| ||
cd mapping_exercise module spider bwa module load bwa/0.7.7 #DONT RUN ON HEAD NODE: bwa index -a bwtsw reference/genome.fa bwa mem #DONT RUN ON HEAD NODE: bwa mem reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq > C1_R1.mem.sam |
Code Block | ||
---|---|---|
| ||
module spider tophat module load tophat/2.0.10 #DONT RUN ON HEAD NODE tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq |
Code Block | ||
---|---|---|
| ||
module load samtools #SYNTAX: samtools view -b -S samfile > bamfile samtools sort bamfile sortedbamfile samtools index sortedbamfile #BWA RESULTS samtools flagstat bwa_results/C1_R1.mem.bam samtools idxstats bwa_results/C1_R1.mem.bam #SAMTOOLS RESULTS samtools flagstat tophat_results/C1_R1_thout/accepted_hits.bam #How is tophat result different from bwa head -20 tophat_results/C1_R1_thout/accepted_hits.sam cut -f 6 tophat_results/C1_R1_thout/accepted_hits.sam|grep 'N'|head cut -f 6 bwa_results/C1_R1.mem.sam|grep 'N'|head #DONT RUN ON HEAD NODE cut -f 6 tophat_results/C1_R1_thout/accepted_hits.sam|grep 'N'|wc -l |
Code Block | ||
---|---|---|
| ||
cd deg_exercise/ module swap intel gcc/4.7.1 module load bedtools bedtools bedtools multicov #DONT RUN ON HEAD NODE bedtools multicov -bams C1_R1_thout/accepted_hits.bam C1_R2_thout/accepted_hits.bam C1_R3_thout/accepted_hits.bam C2_R1_thout/accepted_hits.bam C2_R2_thout/accepted_hits.bam C2_R3_thout/accepted_hits.bam -bed reference/genes.formatted.gtf > gene_counts.gff #What's the difference between bedtools and htseq results? #First look at the annotation file head ../mapping_exercise/reference/genes.formatted.gtf #bedtools result head gene_counts.gff wc -l gene_counts.gff #htseq result head gene_counts.htseq.gff wc -l gene_counts.htseq.gff #DESEQ2 result: head deseq_results/DESeq-C1-vs-C2.csv |
Step 2: Assemble Transcripts/Tuxedo Suite
CUFFLINKS OUTPUT: TRANSCRIPTS.GTF FILE
Each row corresponds to one transcript or an exon of a transcript. First 7 columns are standard gff/gtf columns, followed by some attributes added by cufflinks.
Column number | Column name | Example | Description |
1 | seqname | chrX | Chromosome or contig name |
2 | source | Cufflinks | The name of the program that generated this file (always 'Cufflinks') |
3 | feature | exon | The type of record (always either "transcript" or "exon".) |
4 | start | 77696957 | The leftmost coordinate of this record. |
5 | end | 77712009 | The rightmost coordinate of this record, inclusive. |
6 | score | 77712009 | Score that tells you how abundant this isoform is compared to other isoforms of the gene |
7 | strand | + | Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "." |
7 | frame | . | Not used |
8 | attributes | ... | See below. |
- Each GTF record is decorated with the following attributes:
Attribute | Example | Description |
gene_id | CUFF.1 | Cufflinks gene id |
transcript_id | CUFF.1.1 | Cufflinks transcript id |
FPKM | 101.267 | Isoform-level relative abundance in FPKM |
frac | 0.7647 | Reserved. Please ignore. |
cov | 100.765 | Estimate for the absolute depth of read coverage across the whole transcript |
full_read_support | yes | When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data. |
Code Block | ||
---|---|---|
| ||
cd cufflinks_cuffdiff_exercise ls -l C1_R1_clout head C1_R1_clout/transcripts.gtf |
Find out how many entries in the transcripts.gtf file are from novel transcripts and how many from annotated transcripts
Code Block | ||
---|---|---|
| ||
The secret lies in the gene_id column. #For counting novel entries grep 'CUFF' C1_R1_clout/transcripts.gtf |wc -l 54847 #For counting entries corresponding to annotated genes grep -v 'CUFF' C1_R1_clout/transcripts.gtf |wc -l 88644 What do you think grep -v does? |
CUFFMERGE OUTPUT
Code Block | ||
---|---|---|
| ||
head merged_asm/merged.gtf |
CLASS CODES
Priority | Code | Description | |
1 | = | Complete match of intron chain | |
2 | c | Contained | |
3 | j | Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript | |
4 | e | Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment. | |
5 | i | A transfrag falling entirely within a reference intron | |
6 | o | Generic exonic overlap with a reference transcript | |
7 | p | Possible polymerase run-on fragment (within 2Kbases of a reference transcript) | |
8 | r | Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case | |
9 | u | Unknown, intergenic transcript | |
10 | x | Exonic overlap with reference on the opposite strand | |
11 | s | An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors) |
Code Block | ||
---|---|---|
| ||
#count number of transcripts that are potentially novel grep 'class_code "j"' merged_asm/merged.gtf|cut -f 9|cut -d ';' -f 5|sort|uniq|wc -l |
CUFFDIFF OUTPUT
Code Block | ||
---|---|---|
| ||
ls -l diff_out head diff_out/gene_exp.diff |
Here is a basic command useful for parsing/sorting the gene_exp.diff
or isoform_exp.diff
files:
Code Block | ||
---|---|---|
| ||
cat diff_out/isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head |
Find the 10 most up-regulated genes, by fold change that are classified as significantly changed.
Code Block | ||
---|---|---|
| ||
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nr,10|head #Lets pull out just gene names cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nr,10|head|cut -f 3 |
Expand | ||||
---|---|---|---|---|
| ||||
Top 10 upregulated genes scf crc KdelR Nacalpha CG8979 CG4389 Vha68-2 regucalcin CG3835 CG1746 |
Find the 10 most down-regulated genes, by fold change that are classified as significantly changed.
Code Block | ||
---|---|---|
| ||
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10n,10|head|cut -f 3 |
Expand | ||||
---|---|---|---|---|
| ||||
Top 10 downregulated genes CG17065 Git CG18519 del CG13319 RNaseX25 msb1l achi CG15611 CG11309 |
Find the 10 most up-regulated isoforms, by fold change that are classified as significantly changed. What genes do they belong to?
Code Block | ||
---|---|---|
| ||
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head #To pull out gene names: cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head |cut -f 3 |
Expand | ||||
---|---|---|---|---|
| ||||
sick mub CG5021 akirin CG4587 c(3)G Ank2 CG1674 CrebB-17A stai |
Pull out and count the significantly differentially expressed genes do we have (pvalue <=0.05 and abs fold change > 2 or log2 fold change > 1)?
Code Block | ||
---|---|---|
| ||
cat diff_out/gene_exp.diff |grep 'yes' |awk '{if (($10 >= 1)||($10<=1)) print }' > DEG wc -l DEG |
...