Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titleLook at GFF annotation entries with less
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff .
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .

# Use the less pager to look at multiple lines
less sacCer_R64-1-1_20110208.gff

# Look at just the most-important Tab-separated columns
cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3-5 | head -20

# Include the ugly 9th column where attributes are stored
cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3,9 | head

...

One of the first things you want to know about your annotation file is what gene features it contains. Here's how to find that: (Read more about what's going on here at Piping a histogram.)

Expand
titleSetup (if needed)
Code Block
languagebash
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff .
Code Block
languagebash
titleCreate a histogram of all the feature types in a GFF
cd $SCRATCH/core_ngs/bedtools
cat sacCer_R64-1-1_20110208.gff | grep -v '^#' | cut -f 3 | \
  sort | uniq -c | sort -k1,1nr | more

...

Here we just give the script the GFF file to convert, plus a 1 that tells it to URL decode weird looking text (e.g. our Note attribute values).

Expand
titleSetup (if needed)
Code Block
languagebash
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/*.gff .
Code Block
languagebash
titleConvert GFF to BED with BioITeam script
/work/projects/BioITeam/common/script/gtf_to_bed.pl sc_genes.gff 1 \
  > sc_genes.converted.bed

...

The final transformation is to do a bit of re-ordering, dropping some fields. We'll do this with awk, becuase cut can't re-order fields. While this is not strictly required, it can be helpful to have the critical fields (including the gene ID) in the 1st 6 columns. We do this separately for the header line and the rest of the file so that the BED file we give bedtools does not have a header (but we know what those fields are). We would normally preserve valuable annotation information such as Ontology_term, dbxref and Note, but drop them here for simplicity.

Code Block
languagebash
titleRe-order the final BED fields
head -1 sc_genes.converted.bed | sed 's/\r//' | awk '
 BEGIN{FS=OFS="\t"}{print $1,$2,$3,$10,$5,$6,$15,$16}
 ' > sc_genes.bed.hdr

tail -n +2 sc_genes.converted.bed | sed 's/\r//' | awk '
 BEGIN{FS=OFS="\t"}
 { if($15 == "") {$15 = $10} # make sure gene name is populated
   print $1,$2,$3,$10,$5,$6,$15,$16}
 ' > sc_genes.bed

...

Code Block
titleRe-ordered BED attributes
 1. chrom  2. start             1. chrom  2. start  3. end  4. ID  5. length  6. strand
 7. gene   8. orf_classification

**Whew**! That was a lot of work. Welcome to the world of annotation wrangling. Itwrangling – it's never pretty! But at least the result is much nicer looking. Examine the results using more or less or head:

...


Note that value in the 8th column. In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious. The Dubious ones have no experimental evidence so are generally excluded.

Expand
titleSetup (if needed)
Code Block
languagebash
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/*.gff .
cp $CORENGS/yeast_rna/sc_genes* .

Exercise: How many genes in our sc_genes.bed file are in each category?

...

Code Block
languagebash
titleStart an idev session
idev -p development -m 20 -A UT-2015-05-18 -N 1 -n 24 --reservation=CCBBintro_NGS

Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

Code Block
languagebash
titleSetup for BEDTools exercises
# To catch up...
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/yeastsc_mrnagenes.sort.filt.bam* .bed* . 
cp $CORENGS/yeast_rna/*.gff .

# Copy the BAM file
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sc_genes.bedyeast_mrna.sort.filt.bam* . 

Exercises: How many reads pairs are in the yeast_mrna.sort.filt.bam file? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?

...

Expand
titleAnswer
Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools
module load samtools
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt
Code Block
titlesamtools flagstat output
3323242 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
3323242 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%.

To get the distribution of mapping qualities:

Code Block
languagebash
samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c | sort -k2,2n
Code Block
titledistribution of mapping qualities
    453 20
   6260 21
    889 22
    326 23
    971 24
   2698 25
    376 26
  12769 27
    268 28
    337 29
    933 30
   1229 31
    345 32
   5977 33
    256 34
    249 35
   1103 36
    887 37
    292 38
   4648 39
   5706 40
    426 41
   1946 42
   1547 43
   1761 44
   6138 45
   1751 46
   3019 47
   3710 48
   3236 49
   4467 50
  15691 51
  25370 52
  16636 53
  18081 54
   7084 55
   2701 56
  59851 57
   2836 58
   2118 59
3097901 60

To compute average mapping quality:

Code Block
languagebash
samtools view yeast_mrna.sort.filt.bam | awk '
  BEGIN{FS="\t"; sum=0; tot=0}
  {sum = sum + $5; tot = tot + 1}
  END{print "average:",sum/tot,"for",tot,"reads"}'

Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records.

...

Code Block
languagebash
titleRun bedtools multicov to count BAM alignments overlapping a set of genes
module load bedtools
bedtools multicov -s -bams yeast_mrna.sort.filt.bam \
  -bed sc_genes.bed > yeast_mrna_gene_counts.bed

...

Expand
titleAnswer
Code Block
languagebash
cut -f 9 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
# or
cat yeast_mrna_gene_counts.bed | awk \
  'BEGIN{

Most of the genes (6235/6607) have non-zero read overlap counts.

...