Page History

...

Expand

title	Answer

Code Block

language	bash

cut -f 9 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
# or
cat yeast_mrna_gene_counts.bed | awk \
  awk 'BEGIN{{if ($9 > 0) print $9}' | wc -l

Most of the genes (6235/6607) have non-zero read overlap counts.

...

Expand

title	Answer

Code Block

language	bash

cat yeast_mrna_gene_counts.bed | awk '
 BEGIN{FS="\t";sum=0;tot=0}
 {if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
 END{print sum,"overlapping reads in",tot,"recordsgenes"}'

There are 1144990 overlapping reads in 6235 records

...

Expand

title	Answer

Code Block

language	bash

grep -v 'Dubious' yeast_mrna_gene_counts.bed | awk '
 BEGIN{FS="\t";sum=0;tot=0}
 {if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
 END{print sum,"overlapping reads in",tot,"records non-Dubious genes"}'

There are 1093140 overlapping reads in 5578 recordsnon-Dubious genes

Use bedtools merge to collapse overlapping annotations

...

One way to avoid this double-counting is to collapse the overlapping regions into a merged set of non-overlapping regions – regions – and that's what the the bedtools merge utility does (http://bedtools.readthedocs.io/en/latest/content/tools/merge.html).

...

chrom, start and end of the merged region only, if a stranded merge was not requested
the strand of the merged region in column 4 if a stranded was requested

...

bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):

Expand

title	Setup (if needed)

Code Block

language	bash

mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/*.gff .
cp $CORENGS/yeast_rna/sc_genes.bed* .
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .
module load bedtools

Code Block

language	bash
title	Use bedtools merge to collapse overlapping gene annotations

cd $SCRATCH/core_ngs/bedtools
sort -k1,1 -k2,2n sc_genes.bed > sc_genes.sorted.bed
bedtools merge -i sc_genes.sorted.bed -s -c 4,4 -o count,collapse > merged.sc_genes.txt

...

Expand

title	Answer

Output column 5 has the gene count.

Code Block

language	bash

cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block

language	bash

There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed.

...

Expand

title	Answer

Code Block

language	bash

cd $SCRATCH/core_ngs/bedtools
grep -v 'Dubious' sc_genes.bed > good.sc_genes.bed

sort -k1,1 -k2,2n good.sc_genes.bed > good.sc_genes.sorted.bed
bedtools merge -i good.sc_genes.sorted.bed -s \
  -c 4,4 -o count,collapse > merged.good.sc_genes.txt

wc -l good.sc_genes.bed, merged.good.sc_genes.txt

There were 5797 "good" (non-Dubious) genes before merging and 5770 after.

Code Block

language	bash

cut -f 5 merged.good.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block

language	bash

Now there are only 20 regions where more than one gene was collapsed. Clearly eliminating the Dubious ORFs helped.

...

Expand

title	Answer

The output does not follow the BED6 specification: "chrom, start, end, name, score, strand"

The first 3 output columns comply with the BED3 standard (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).

We can use awk to re-order the fields:

Code Block

language	bash

cat merged.good.sc_genes.txt | awk '
  BEGIN{FS=OFS="\t"}
  {print $1,$2,$3,$6,$5,$4}' > merged.good.sc_genes.bed

...

Page tree

Versions Compared

Old Version 52

New Version 53

Key

Use bedtools merge to collapse overlapping annotations