Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Use bedtools coverage to create a signal track

A signal track is a bedGraph (BED3+) file with an entry for every base in a defined set of regions (see https://genome.ucsc.edu/goldenpath/help/bedgraph.html). bedGraph files can be visualized in the Broad's IGV (Integrative Genomics Viewer) application (https://software.broadinstitute.org/software/igv/download) or in the UCSC Genome Browser (https://genome.ucsc.edu/).

The bedtools coverage function (https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html), with the -d (per-base depth) option produces output that can be made into a bedGraph. Here we'll analyze the per-base coverage of yeast RNAseq reads in our merged yeast gene regions.

Make sure you're in an idev session, then prepare a directory for this exercise.

Code Block
languagebash
titlePrepare for bedtools coverage
idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5
module load biocontainers
module load bedtools

mkdir -p $SCRATCH/core_ngs/bedtools_coverage
cp $CORENGS/catchup/bedtools_merge/merged*bed      $SCRATCH/core_ngs/bedtools_coverage/
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* $SCRATCH/core_ngs/bedtools_coverage/

Then calling bedtools coverage is easy. The "A" file will be our gene regions, and the "B" file will be the yeast RNAseq reads. We also use the -d (per-base depth) and -s (force "strandedness") options.

Code Block
languagebash
cds; cd core_ngs/bedtools_coverage
bedtools coverage -s -d -a merged.good.sc_genes.bed -b yeast_mrna.sort.filt.bam > yeast_mrna.gene_coverage.txt

wc -l yeast_mrna.gene_coverage.txt # 8,829,317 lines!

It will complain a bit because our genes file includes the yeast plasmid "2-micron" but the RNAseq BAM doesn't include that contig. We'll ignore that warning.

The bedtools coverage output is a bit strange. It lists each region in the A file, followed by information from the B reads. Here the column order will be gene_chrom gene_start gene_end gene_name gene_score gene_strand offset_in_the_gene_region read_overlap count.

Let's look at coverage for gene YAL067C:

Code Block
languagebash
cat yeast_mrna.gene_coverage.txt | grep -P 'YAL067C' | head -50

Will look like this:

Code Block
chrI    7234    9016    YAL067C 1       -       1       0
chrI    7234    9016    YAL067C 1       -       2       0
chrI    7234    9016    YAL067C 1       -       3       0
chrI    7234    9016    YAL067C 1       -       4       0
chrI    7234    9016    YAL067C 1       -       5       0
chrI    7234    9016    YAL067C 1       -       6       0
chrI    7234    9016    YAL067C 1       -       7       0
chrI    7234    9016    YAL067C 1       -       8       0
chrI    7234    9016    YAL067C 1       -       9       0
chrI    7234    9016    YAL067C 1       -       10      0
chrI    7234    9016    YAL067C 1       -       11      0
chrI    7234    9016    YAL067C 1       -       12      0
chrI    7234    9016    YAL067C 1       -       13      0
chrI    7234    9016    YAL067C 1       -       14      0
chrI    7234    9016    YAL067C 1       -       15      0
chrI    7234    9016    YAL067C 1       -       16      0
chrI    7234    9016    YAL067C 1       -       17      1
chrI    7234    9016    YAL067C 1       -       18      1
chrI    7234    9016    YAL067C 1       -       19      1
chrI    7234    9016    YAL067C 1       -       20      1
chrI    7234    9016    YAL067C 1       -       21      1
chrI    7234    9016    YAL067C 1       -       22      1
chrI    7234    9016    YAL067C 1       -       23      1
chrI    7234    9016    YAL067C 1       -       24      1
chrI    7234    9016    YAL067C 1       -       25      1
chrI    7234    9016    YAL067C 1       -       26      1
chrI    7234    9016    YAL067C 1       -       27      1
chrI    7234    9016    YAL067C 1       -       28      1
chrI    7234    9016    YAL067C 1       -       29      1
chrI    7234    9016    YAL067C 1       -       30      1
chrI    7234    9016    YAL067C 1       -       31      1
chrI    7234    9016    YAL067C 1       -       32      1
chrI    7234    9016    YAL067C 1       -       33      1
chrI    7234    9016    YAL067C 1       -       34      1
chrI    7234    9016    YAL067C 1       -       35      1
chrI    7234    9016    YAL067C 1       -       36      1
chrI    7234    9016    YAL067C 1       -       37      1
chrI    7234    9016    YAL067C 1       -       38      2
chrI    7234    9016    YAL067C 1       -       39      2
chrI    7234    9016    YAL067C 1       -       40      2
chrI    7234    9016    YAL067C 1       -       41      3
chrI    7234    9016    YAL067C 1       -       42      3
chrI    7234    9016    YAL067C 1       -       43      3
chrI    7234    9016    YAL067C 1       -       44      3
chrI    7234    9016    YAL067C 1       -       45      4
chrI    7234    9016    YAL067C 1       -       46      4
chrI    7234    9016    YAL067C 1       -       47      4
chrI    7234    9016    YAL067C 1       -       48      4
chrI    7234    9016    YAL067C 1       -       49      4
chrI    7234    9016    YAL067C 1       -       50      4

A proper bedGraph file has only 4 columns: chrom start end value and does not need to include positions with 0 reads, so we'll convert the bedtools coverage output to bedGraph using awk. We re-sort the output so that plus and minus strand positions are adjacent.

Code Block
languagebash
cat yeast_mrna.gene_coverage.txt | awk '
BEGIN{FS=OFS="\t"}
{if ($8>0) {print $1,$2-1+$7,$2+$7,$8}}' | \
  sort -k1,1 -k2,2n -k3,3n > yeast_mrna.gene_coverage.almost.bedGraph

wc -l yeast_mrna.gene_coverage.almost.bedGraph  # 5,710,186 -- better, but still big

While we probably could consider this file to have bedGraph format, it's preferable to combine adjacent per-base coordinates with the same count into larger regions, e.g.

Code Block
# per-base counts
chrI    7271    7272    2
chrI    7272    7273    2
chrI    7273    7274    2
chrI    7274    7275    3
chrI    7275    7276    3
chrI    7276    7277    3
chrI    7277    7278    3

# corresponding region counts
chrI    7271    7274    6
chrI    7274    7278    12

Here's some awk to do this:

Code Block
languagebash
cat yeast_mrna.gene_coverage.almost.bedGraph | awk '
BEGIN{FS=OFS="\t"; chr=""; start=-1; end=-1; count=0}
{if (chr != $1) { # new contig; finish previous
   if (count > 0) { print chr,start,end,count }
   chr=$1; start=$2; end=$3; count=$4
 } else if (($2==end || $2==end+1) && ($4==count)) { # same or adjacent position with same count
   end=$3;
 } else { # new region on same contig; finish prev
   if (count > 0) { print chr,start,end,count}
   start=$2; end=$3; count=$4
 }
}
END{ # finish last
  if (count > 0) { print chr,start,end,count }
}' > yeast_mrna.gene_coverage.bedGraph

wc -l yeast_mrna.gene_coverage.bedGraph  # 1,048,510 -- much better!

Make sure the total counts match!

Code Block
languagebash
cat yeast_mrna.gene_coverage.txt | awk '
  BEGIN{tot=0}{tot=tot+$8}END{print tot}'          # should be 86703686 
cat yeast_mrna.gene_coverage.almost.bed | awk '
  BEGIN{tot=0}{tot=tot+$4}END{print tot}'          # should also be 86703686 
cat yeast_mrna.gene_coverage.bedGraph | awk '
  BEGIN{tot=0}{tot=tot+$4*($3-$2)}END{print tot}'  # should also be 86703686

Now our yeast_mrna.gene_coverage.bedGraph file is a proper bedGraph, whose first lines look like this:

Code Block
chrI    7250    7271    1
chrI    7271    7274    2
chrI    7274    7278    3
chrI    7278    7310    4
chrI    7310    7317    3
chrI    7317    7349    2
chrI    7349    7353    1
chrI    7500    7556    1
chrI    8851    8891    1
chrI    11919   11951   1

x

A brief introduction to bedtools

...

Expand
titleclick here to see the code and output


Code Block
languagebash
titlesolution code
module load bedtools
bedtools bamtofastq -i yeast_pairedend_sort.mapped.q1.bam -fq yeast_pairedend_sort.mapped.q1.fastq #takes 1-2 minutes
more yeast_pairedend_sort.mapped.q1.fastq

Here is an example of two sequences (and their corresponding quality scores):

 


Code Block
languagebash
titletwo lines of a fastq file
@HWI-ST1097:127:C0W5VACXX:5:2212:10568:79659
TACCCTCCAATTACCCATATCCAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATAT
+
CCCFFFFFHHHHHJJJJJIIJJJJIJJIJJIJJIIIIJJJIJJIJJIJJIJJJJJJJJJJIIGGIGEGAEHFEFFEFFFDEEE@CCEDCDDD>ACBBDCA@
@HWI-ST1097:127:C0W5VACXX:5:2115:19940:13862
TAGGGTAAGTTTGAGATGGTATATACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAA
+
?B@DF2ADHFHHFHJIIIGCIHIGGIJJJJGHIIIGIJEHHIGGGAHEGGFGHIECGIJIIIJIIIIIJJJJJJE>EHDHEEEBCDD?CDDBDDDDDDCDB

As we discussed earlier, the top line is the identifier for the sequence produced, the second line defines which bases were produced, the third line indicates the strand the sequence is aligned to, and the fourth line indicates the ASCII based quality scores for each character in the second line.

...

When we originally examined the bed files produced from our BAM file, we can see many reads that overlap over the same interval.  While this level of detail is useful, for some analyses, we can collapse each read into a single line, and indicate how many reads occurred over that genomic interval.  We can accomplish this using bedtools merge.

Code Block
languagebash
bedtools merge [OPTIONS] -i experiment.bed > experiment.merge.bed

Bedtools merge also directs the output to standard out, to make sure to point the output to a file or a program.  While we haven't discussed the options for each bedtools function in detail, here they are very important.  Many of the options define what to do with each column (-c) of the output (-o).  This defines what type of operation to perform on each column, and in what order to output the columns.  Standard bed6 format is chrom, start, stop, name, score, strand and controlling column operations allows you to control what to put into each column of output.  The valid operations defined by the -o operation are as follows:

 


Expand
titlevalid operations using the -o option
  • sum, min, max, absmin, absmax,
  • mean, median,
  • collapse (i.e., print a delimited list (duplicates allowed)),
  • distinct (i.e., print a delimited list (NO duplicates allowed)),
  • count
  • count_distinct (i.e., a count of the unique values in the column)

For this exercise, we'll be summing the number of reads over a region to get a score column, using distinct to choose a name, and using distinct again to keep track of the strand.  For the -c options, define which columns to operate on, in the order you want the output.  In this case, to keep the standard bed format, we'll list as -c 5,4,6 and -o count_distinct,sum,distinct, to keep the proper order of name, score, strand.  Strandedness can also be controlled with merge, using the -s option. 

...