...
Expand |
---|
|
Code Block |
---|
| cat yeast_mrna_gene_counts.bed | awk '
BEGIN{FS="\t";sum=0;tot=0}
{if($7 > 0) { sum = sum + $7; tot = tot + 1 }}
END{printf("%d overlapping reads in %d genes\n", sum, tot) }' |
There are 1,152,831 overlapping reads in 6,141 non-0 gene annotations. |
Use bedtools coverage to create a signal track
A signal track is a bedGraph (BED3+) file with an entry for every base in a defined set of regions (see https://genome.ucsc.edu/goldenpath/help/bedgraph.html). bedGraph files can be visualized in the Broad's IGV (Integrative Genomics Viewer) application (https://software.broadinstitute.org/software/igv/download) or in the UCSC Genome Browser (https://genome.ucsc.edu/).
The bedtools coverage function (https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html), with the -d (per-base depth) option produces output that can be made into a bedGraph. Here we'll analyze the per-base coverage of yeast RNAseq reads in our merged yeast gene regions.
Make sure you're in an idev session, then prepare a directory for this exercise.
Code Block |
---|
language | bash |
---|
title | Prepare for bedtools coverage |
---|
|
idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5
module load biocontainers
module load bedtools
mkdir -p $SCRATCH/core_ngs/bedtools_coverage
cp $CORENGS/catchup/bedtools_merge/merged*bed $SCRATCH/core_ngs/bedtools_coverage/
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* $SCRATCH/core_ngs/bedtools_coverage/ |
Then calling bedtools coverage is easy. The "A" file will be our gene regions, and the "B" file will be the yeast RNAseq reads. We also use the -d (per-base depth) and -s (force "strandedness") options.
Code Block |
---|
|
cds; cd core_ngs/bedtools_coverage
bedtools coverage -s -d -a merged.good.sc_genes.bed -b yeast_mrna.sort.filt.bam > yeast_mrna.gene_coverage.txt
wc -l yeast_mrna.gene_coverage.txt # 8,829,317 lines! |
It will complain a bit because our genes file includes the yeast plasmid "2-micron" but the RNAseq BAM doesn't include that contig. We'll ignore that warning.
The bedtools coverage output is a bit strange. It lists each region in the A file, followed by information from the B reads. Here the column order will be gene_chrom gene_start gene_end gene_name gene_score gene_strand offset_in_the_gene_region read_overlap count.
Let's look at coverage for gene YAL067C:
Code Block |
---|
|
cat yeast_mrna.gene_coverage.txt | grep -P 'YAL067C' | head -50 |
Will look like this:
Code Block |
---|
chrI 7234 9016 YAL067C 1 - 1 0
chrI 7234 9016 YAL067C 1 - 2 0
chrI 7234 9016 YAL067C 1 - 3 0
chrI 7234 9016 YAL067C 1 - 4 0
chrI 7234 9016 YAL067C 1 - 5 0
chrI 7234 9016 YAL067C 1 - 6 0
chrI 7234 9016 YAL067C 1 - 7 0
chrI 7234 9016 YAL067C 1 - 8 0
chrI 7234 9016 YAL067C 1 - 9 0
chrI 7234 9016 YAL067C 1 - 10 0
chrI 7234 9016 YAL067C 1 - 11 0
chrI 7234 9016 YAL067C 1 - 12 0
chrI 7234 9016 YAL067C 1 - 13 0
chrI 7234 9016 YAL067C 1 - 14 0
chrI 7234 9016 YAL067C 1 - 15 0
chrI 7234 9016 YAL067C 1 - 16 0
chrI 7234 9016 YAL067C 1 - 17 1
chrI 7234 9016 YAL067C 1 - 18 1
chrI 7234 9016 YAL067C 1 - 19 1
chrI 7234 9016 YAL067C 1 - 20 1
chrI 7234 9016 YAL067C 1 - 21 1
chrI 7234 9016 YAL067C 1 - 22 1
chrI 7234 9016 YAL067C 1 - 23 1
chrI 7234 9016 YAL067C 1 - 24 1
chrI 7234 9016 YAL067C 1 - 25 1
chrI 7234 9016 YAL067C 1 - 26 1
chrI 7234 9016 YAL067C 1 - 27 1
chrI 7234 9016 YAL067C 1 - 28 1
chrI 7234 9016 YAL067C 1 - 29 1
chrI 7234 9016 YAL067C 1 - 30 1
chrI 7234 9016 YAL067C 1 - 31 1
chrI 7234 9016 YAL067C 1 - 32 1
chrI 7234 9016 YAL067C 1 - 33 1
chrI 7234 9016 YAL067C 1 - 34 1
chrI 7234 9016 YAL067C 1 - 35 1
chrI 7234 9016 YAL067C 1 - 36 1
chrI 7234 9016 YAL067C 1 - 37 1
chrI 7234 9016 YAL067C 1 - 38 2
chrI 7234 9016 YAL067C 1 - 39 2
chrI 7234 9016 YAL067C 1 - 40 2
chrI 7234 9016 YAL067C 1 - 41 3
chrI 7234 9016 YAL067C 1 - 42 3
chrI 7234 9016 YAL067C 1 - 43 3
chrI 7234 9016 YAL067C 1 - 44 3
chrI 7234 9016 YAL067C 1 - 45 4
chrI 7234 9016 YAL067C 1 - 46 4
chrI 7234 9016 YAL067C 1 - 47 4
chrI 7234 9016 YAL067C 1 - 48 4
chrI 7234 9016 YAL067C 1 - 49 4
chrI 7234 9016 YAL067C 1 - 50 4 |
A proper bedGraph file has only 4 columns: chrom start end value and does not need to include positions with 0 reads, so we'll convert the bedtools coverage output to bedGraph using awk. We re-sort the output so that plus and minus strand positions are adjacent.
Code Block |
---|
|
cat yeast_mrna.gene_coverage.txt | awk '
BEGIN{FS=OFS="\t"}
{if ($8>0) {print $1,$2-1+$7,$2+$7,$8}}' | \
sort -k1,1 -k2,2n -k3,3n > yeast_mrna.gene_coverage.almost.bedGraph
wc -l yeast_mrna.gene_coverage.almost.bedGraph # 5,710,186 -- better, but still big |
While we probably could consider this file to have bedGraph format, it's preferable to combine adjacent per-base coordinates with the same count into larger regions, e.g.
Code Block |
---|
# per-base counts
chrI 7271 7272 2
chrI 7272 7273 2
chrI 7273 7274 2
chrI 7274 7275 3
chrI 7275 7276 3
chrI 7276 7277 3
chrI 7277 7278 3
# corresponding region counts
chrI 7271 7274 6
chrI 7274 7278 12 |
Here's some awk to do this:
Code Block |
---|
|
cat yeast_mrna.gene_coverage.almost.bedGraph | awk '
BEGIN{FS=OFS="\t"; chr=""; start=-1; end=-1; count=0}
{if (chr != $1) { # new contig; finish previous
if (count > 0) { print chr,start,end,count }
chr=$1; start=$2; end=$3; count=$4
} else if (($2==end || $2==end+1) && ($4==count)) { # same or adjacent position with same count
end=$3;
} else { # new region on same contig; finish prev
if (count > 0) { print chr,start,end,count}
start=$2; end=$3; count=$4
}
}
END{ # finish last
if (count > 0) { print chr,start,end,count }
}' > yeast_mrna.gene_coverage.bedGraph
wc -l yeast_mrna.gene_coverage.bedGraph # 1,048,510 -- much better! |
Make sure the total counts match!
Code Block |
---|
|
cat yeast_mrna.gene_coverage.txt | awk '
BEGIN{tot=0}{tot=tot+$8}END{print tot}' # should be 86703686
cat yeast_mrna.gene_coverage.almost.bed | awk '
BEGIN{tot=0}{tot=tot+$4}END{print tot}' # should also be 86703686
cat yeast_mrna.gene_coverage.bedGraph | awk '
BEGIN{tot=0}{tot=tot+$4*($3-$2)}END{print tot}' # should also be 86703686 |
Now our yeast_mrna.gene_coverage.bedGraph file is a proper bedGraph, whose first lines look like this:
Code Block |
---|
chrI 7250 7271 1
chrI 7271 7274 2
chrI 7274 7278 3
chrI 7278 7310 4
chrI 7310 7317 3
chrI 7317 7349 2
chrI 7349 7353 1
chrI 7500 7556 1
chrI 8851 8891 1
chrI 11919 11951 1 |