Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleAnswer
Code Block
languagebash
cut -f 9 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
# or
cat yeast_mrna_gene_counts.bed | awk \
  awk 'BEGIN{{if ($9 > 0) print $9}' | wc -l

Most of the genes (6235/6607) have non-zero read overlap counts.

...

Expand
titleAnswer
Code Block
languagebash
cat yeast_mrna_gene_counts.bed | awk '
 BEGIN{FS="\t";sum=0;tot=0}
 {if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
 END{print sum,"overlapping reads in",tot,"recordsgenes"}'

There are 1144990 overlapping reads in 6235 records

...

Expand
titleAnswer
Code Block
languagebash
grep -v 'Dubious' yeast_mrna_gene_counts.bed | awk '
 BEGIN{FS="\t";sum=0;tot=0}
 {if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
 END{print sum,"overlapping reads in",tot,"records non-Dubious genes"}'

There are 1093140 overlapping reads in 5578 recordsnon-Dubious genes

Use bedtools merge to collapse overlapping annotations

...

One way to avoid this double-counting is to collapse the overlapping regions into a merged set of non-overlapping regions – regions – and that's what the the bedtools merge utility does (http://bedtools.readthedocs.io/en/latest/content/tools/merge.html).

...

  • chrom, start and end of the merged region only, if a stranded merge was not requested
  • the strand of the merged region in column 4 if a stranded was requested

...

bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):

Expand
titleSetup (if needed)
Code Block
languagebash
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/*.gff .
cp $CORENGS/yeast_rna/sc_genes.bed* .
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .
module load bedtools
Code Block
languagebash
titleUse bedtools merge to collapse overlapping gene annotations
cd $SCRATCH/core_ngs/bedtools
sort -k1,1 -k2,2n sc_genes.bed > sc_genes.sorted.bed
bedtools merge -i sc_genes.sorted.bed -s -c 4,4 -o count,collapse > merged.sc_genes.txt

...

Expand
titleAnswer

Output column 5 has the gene count.

Code Block
languagebash
cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block
languagebash
   6374 1
    105 2
      4 3
      1 4
      1 7

There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed.

...

Expand
titleAnswer
Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools
grep -v 'Dubious' sc_genes.bed > good.sc_genes.bed

sort -k1,1 -k2,2n good.sc_genes.bed > good.sc_genes.sorted.bed
bedtools merge -i good.sc_genes.sorted.bed -s \
  -c 4,4 -o count,collapse > merged.good.sc_genes.txt

wc -l good.sc_genes.bed, merged.good.sc_genes.txt

There were 5797 "good" (non-Dubious) genes before merging and 5770 after.

Code Block
languagebash
cut -f 5 merged.good.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block
languagebash
   5750 1
     18 2
      1 4
      1 7

Now there are only 20 regions where more than one gene was collapsed. Clearly eliminating the Dubious ORFs helped.

...

Expand
titleAnswer

The output does not follow the BED6 specification: "chrom, start, end, name, score, strand"

The first 3 output columns comply with the BED3 standard (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there).

We can use awk to re-order the fields:

Code Block
languagebash
cat merged.good.sc_genes.txt | awk '
  BEGIN{FS=OFS="\t"}
  {print $1,$2,$3,$6,$5,$4}' > merged.good.sc_genes.bed

...