Page History

...

Code Block

language	bash
title	Look at GFF annotation entries with less

mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff .
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .

# Use the less pager to look at multiple lines
less sacCer_R64-1-1_20110208.gff

# Look at just the most-important Tab-separated columns
cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3-5 | head -20

# Include the ugly 9th column where attributes are stored
cat sacCer_R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3,9 | head

...

One of the first things you want to know about your annotation file is what gene features it contains. Here's how to find that: (Read more about what's going on here at Piping a histogram.)

Expand

title	Setup (if needed)

Code Block

language	bash

mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff .

Code Block

language	bash
title	Create a histogram of all the feature types in a GFF

cd $SCRATCH/core_ngs/bedtools
cat sacCer_R64-1-1_20110208.gff | grep -v '^#' | cut -f 3 | \
  sort | uniq -c | sort -k1,1nr | more

...

Here we just give the script the GFF file to convert, plus a 1 that tells it to URL decode weird looking text (e.g. our Note attribute values).

Expand

title	Setup (if needed)

Code Block

language	bash

mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/*.gff .

Code Block

language	bash
title	Convert GFF to BED with BioITeam script

/work/projects/BioITeam/common/script/gtf_to_bed.pl sc_genes.gff 1 \
  > sc_genes.converted.bed

...

The final transformation is to do a bit of re-ordering, dropping some fields. We'll do this with awk, becuase cut can't re-order fields. While this is not strictly required, it can be helpful to have the critical fields (including the gene ID) in the 1st 6 columns. We do this separately for the header line and the rest of the file so that the BED file we give bedtools does not have a header (but we know what those fields are). We would normally preserve valuable annotation information such as Ontology_term, dbxref and Note, but drop them here for simplicity.

Code Block

language	bash
title	Re-order the final BED fields

head -1 sc_genes.converted.bed | sed 's/\r//' | awk '
 BEGIN{FS=OFS="\t"}{print $1,$2,$3,$10,$5,$6,$15,$16}
 ' > sc_genes.bed.hdr

tail -n +2 sc_genes.converted.bed | sed 's/\r//' | awk '
 BEGIN{FS=OFS="\t"}
 { if($15 == "") {$15 = $10} # make sure gene name is populated
   print $1,$2,$3,$10,$5,$6,$15,$16}
 ' > sc_genes.bed

...

Code Block

title	Re-ordered BED attributes

 1. chrom  2. start             1. chrom  2. start  3. end  4. ID  5. length  6. strand
 7. gene   8. orf_classification

**Whew**! That was a lot of work. Welcome to the world of annotation wrangling. Itwrangling – it's never pretty! But at least the result is much nicer looking. Examine the results using more or less or head:

...

Note that value in the 8th column. In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious. The Dubious ones have no experimental evidence so are generally excluded.

Expand

title	Setup (if needed)

Code Block

language	bash

mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/*.gff .
cp $CORENGS/yeast_rna/sc_genes* .

Exercise: How many genes in our sc_genes.bed file are in each category?

...

Code Block

language	bash
title	Start an idev session

idev -p development -m 20 -A UT-2015-05-18 -N 1 -n 24 --reservation=CCBBintro_NGS

Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

Code Block

language	bash
title	Setup for BEDTools exercises

# To catch up...
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/yeastsc_mrnagenes.sort.filt.bam* .bed* . 
cp $CORENGS/yeast_rna/*.gff .

# Copy the BAM file
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sc_genes.bedyeast_mrna.sort.filt.bam* .

Exercises: How many reads pairs are in the yeast_mrna.sort.filt.bam file? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?

...

Expand

title	Answer

Code Block

language	bash

cd $SCRATCH/core_ngs/bedtools
module load samtools
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt

Code Block

title	samtools flagstat output

3323242 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
3323242 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%.

To get the distribution of mapping qualities:

Code Block

language	bash

samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c | sort -k2,2n

Code Block

title	distribution of mapping qualities

To compute average mapping quality:

Code Block

language	bash

samtools view yeast_mrna.sort.filt.bam | awk '
  BEGIN{FS="\t"; sum=0; tot=0}
  {sum = sum + $5; tot = tot + 1}
  END{print "average:",sum/tot,"for",tot,"reads"}'

Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records.

...

Code Block

language	bash
title	Run bedtools multicov to count BAM alignments overlapping a set of genes

module load bedtools
bedtools multicov -s -bams yeast_mrna.sort.filt.bam \
  -bed sc_genes.bed > yeast_mrna_gene_counts.bed

...

Expand

title	Answer

Code Block

language	bash

cut -f 9 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
# or
cat yeast_mrna_gene_counts.bed | awk \
  'BEGIN{

Most of the genes (6235/6607) have non-zero read overlap counts.

...

Page tree

Versions Compared

Old Version 50

New Version 51

Key