...
Expand |
---|
|
Code Block |
---|
| cut -f 9 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l
# or
cat yeast_mrna_gene_counts.bed | awk \
awk 'BEGIN{{if ($9 > 0) print $9}' | wc -l |
Most of the genes (6235/6607) have non-zero read overlap counts. |
...
Expand |
---|
|
Code Block |
---|
| cat yeast_mrna_gene_counts.bed | awk '
BEGIN{FS="\t";sum=0;tot=0}
{if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
END{print sum,"overlapping reads in",tot,"recordsgenes"}' |
There are 1144990 overlapping reads in 6235 records |
...
Expand |
---|
|
Code Block |
---|
| grep -v 'Dubious' yeast_mrna_gene_counts.bed | awk '
BEGIN{FS="\t";sum=0;tot=0}
{if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
END{print sum,"overlapping reads in",tot,"records non-Dubious genes"}' |
There are 1093140 overlapping reads in 5578 recordsnon-Dubious genes |
Use bedtools merge to collapse overlapping annotations
...
One way to avoid this double-counting is to collapse the overlapping regions into a merged set of non-overlapping regions – regions – and that's what the the bedtools merge utility does (http://bedtools.readthedocs.io/en/latest/content/tools/merge.html).
...
- chrom, start and end of the merged region only, if a stranded merge was not requested
- the strand of the merged region in column 4 if a stranded was requested
...
bedtools merge also requires that the input BED file be sorted by locus (chrom + start), so we do that first, then we request a strand-specific merge (-s):
Expand |
---|
|
Code Block |
---|
| mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/*.gff .
cp $CORENGS/yeast_rna/sc_genes.bed* .
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .
module load bedtools |
|
Code Block |
---|
language | bash |
---|
title | Use bedtools merge to collapse overlapping gene annotations |
---|
|
cd $SCRATCH/core_ngs/bedtools
sort -k1,1 -k2,2n sc_genes.bed > sc_genes.sorted.bed
bedtools merge -i sc_genes.sorted.bed -s -c 4,4 -o count,collapse > merged.sc_genes.txt |
...
Expand |
---|
|
Output column 5 has the gene count. Code Block |
---|
| cut -f 5 merged.sc_genes.txt | sort | uniq -c | sort -k2,2n |
Produces this histogram: Code Block |
---|
| 6374 1
105 2
4 3
1 4
1 7 |
There are 111 regions (105 + 4 + 1 + 1) where more than one gene contributed. |
...
Expand |
---|
|
Code Block |
---|
| cd $SCRATCH/core_ngs/bedtools
grep -v 'Dubious' sc_genes.bed > good.sc_genes.bed
sort -k1,1 -k2,2n good.sc_genes.bed > good.sc_genes.sorted.bed
bedtools merge -i good.sc_genes.sorted.bed -s \
-c 4,4 -o count,collapse > merged.good.sc_genes.txt
wc -l good.sc_genes.bed, merged.good.sc_genes.txt |
There were 5797 "good" (non-Dubious) genes before merging and 5770 after. Code Block |
---|
| cut -f 5 merged.good.sc_genes.txt | sort | uniq -c | sort -k2,2n |
Produces this histogram: Code Block |
---|
| 5750 1
18 2
1 4
1 7 |
Now there are only 20 regions where more than one gene was collapsed. Clearly eliminating the Dubious ORFs helped. |
...
Expand |
---|
|
The output does not follow the BED6 specification: "chrom, start, end, name, score, strand" The first 3 output columns comply with the BED3 standard (chrom, start, end), but if strand is to be included, it should be in column 6. Column 4 should be name (we'll put the collapsed gene name list there), and column 5 a score (we'll put the region count there). We can use awk to re-order the fields: Code Block |
---|
| cat merged.good.sc_genes.txt | awk '
BEGIN{FS=OFS="\t"}
{print $1,$2,$3,$6,$5,$4}' > merged.good.sc_genes.bed |
|
...