...
Code Block |
---|
language | bash |
---|
title | Look at GFF annotation entries with less |
---|
|
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff .
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .
# Use the less pager to look at multiple lines
less sacCer_R64-1-1_20110208.gff
# Look at just the most-important Tab-separated columns
cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3-5 | head -20
# Include the ugly 9th column where attributes are stored
cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3,9 | head |
...
One of the first things you want to know about your annotation file is what gene features it contains. Here's how to find that: (Read more about what's going on here at Piping a histogram.)
Expand |
---|
|
Code Block |
---|
| mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff . |
|
Code Block |
---|
language | bash |
---|
title | Create a histogram of all the feature types in a GFF |
---|
|
cd $SCRATCH/core_ngs/bedtools
cat sacCer_R64-1-1_20110208.gff | grep -v '^#' | cut -f 3 | \
sort | uniq -c | sort -k1,1nr | more |
...
Here we just give the script the GFF file to convert, plus a 1 that tells it to URL decode weird looking text (e.g. our Note attribute values).
Expand |
---|
|
Code Block |
---|
| mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/*.gff . |
|
Code Block |
---|
language | bash |
---|
title | Convert GFF to BED with BioITeam script |
---|
|
/work/projects/BioITeam/common/script/gtf_to_bed.pl sc_genes.gff 1 \
> sc_genes.converted.bed |
...
The final transformation is to do a bit of re-ordering, dropping some fields. We'll do this with awk, becuase cut can't re-order fields. While this is not strictly required, it can be helpful to have the critical fields (including the gene ID) in the 1st 6 columns. We do this separately for the header line and the rest of the file so that the BED file we give bedtools does not have a header (but we know what those fields are). We would normally preserve valuable annotation information such as Ontology_term, dbxref and Note, but drop them here for simplicity.
Code Block |
---|
language | bash |
---|
title | Re-order the final BED fields |
---|
|
head -1 sc_genes.converted.bed | sed 's/\r//' | awk '
BEGIN{FS=OFS="\t"}{print $1,$2,$3,$10,$5,$6,$15,$16}
' > sc_genes.bed.hdr
tail -n +2 sc_genes.converted.bed | sed 's/\r//' | awk '
BEGIN{FS=OFS="\t"}
{ if($15 == "") {$15 = $10} # make sure gene name is populated
print $1,$2,$3,$10,$5,$6,$15,$16}
' > sc_genes.bed |
...
Code Block |
---|
title | Re-ordered BED attributes |
---|
|
1. chrom 2. start 1. chrom 2. start 3. end 4. ID 5. length 6. strand
7. gene 8. orf_classification |
**Whew**! That was a lot of work. Welcome to the world of annotation wrangling. Itwrangling – it's never pretty! But at least the result is much nicer looking. Examine the results using more or less or head:
...
Note that value in the 8th column. In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious. The Dubious ones have no experimental evidence so are generally excluded.
Expand |
---|
|
Code Block |
---|
| mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/*.gff .
cp $CORENGS/yeast_rna/sc_genes* . |
|
Exercise: How many genes in our sc_genes.bed file are in each category?
...
Code Block |
---|
language | bash |
---|
title | Start an idev session |
---|
|
idev -p development -m 20 -A UT-2015-05-18 -N 1 -n 24 --reservation=CCBBintro_NGS |
Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).
Code Block |
---|
language | bash |
---|
title | Setup for BEDTools exercises |
---|
|
# To catch up...
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/yeastsc_mrnagenes.sort.filt.bam* .bed* .
cp $CORENGS/yeast_rna/*.gff .
# Copy the BAM file
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sc_genes.bedyeast_mrna.sort.filt.bam* . |
Exercises: How many reads pairs are in the yeast_mrna.sort.filt.bam file? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?
...
Expand |
---|
|
Code Block |
---|
| cd $SCRATCH/core_ngs/bedtools
module load samtools
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt |
Code Block |
---|
title | samtools flagstat output |
---|
| 3323242 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
3323242 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5) |
There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%. To get the distribution of mapping qualities: Code Block |
---|
| samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c | sort -k2,2n |
Code Block |
---|
title | distribution of mapping qualities |
---|
| 453 20
6260 21
889 22
326 23
971 24
2698 25
376 26
12769 27
268 28
337 29
933 30
1229 31
345 32
5977 33
256 34
249 35
1103 36
887 37
292 38
4648 39
5706 40
426 41
1946 42
1547 43
1761 44
6138 45
1751 46
3019 47
3710 48
3236 49
4467 50
15691 51
25370 52
16636 53
18081 54
7084 55
2701 56
59851 57
2836 58
2118 59
3097901 60 |
To compute average mapping quality: Code Block |
---|
| samtools view yeast_mrna.sort.filt.bam | awk '
BEGIN{FS="\t"; sum=0; tot=0}
{sum = sum + $5; tot = tot + 1}
END{print "average:",sum/tot,"for",tot,"reads"}' |
Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records. |
...
Code Block |
---|
language | bash |
---|
title | Run bedtools multicov to count BAM alignments overlapping a set of genes |
---|
|
module load bedtools
bedtools multicov -s -bams yeast_mrna.sort.filt.bam \
-bed sc_genes.bed > yeast_mrna_gene_counts.bed |
...
Expand |
---|
|
Code Block |
---|
| cut -f 9 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
# or
cat yeast_mrna_gene_counts.bed | awk \
'BEGIN{ |
Most of the genes (6235/6607) have non-zero read overlap counts. |
...