Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleclick to see the code and output
Code Block
languagebash
titlesolution code
module load bedtools #if you haven't loaded it in for this session
bedtools bamtobed -i yeast_pairedend_sort.mapped.q1.bam > yeast_pairedend_sort.mapped.q1.bed

more yeast_pairedend_sort.mapped.q1.bed #to examine the bed file visually
wc -l yeast_pairedend_sort.mapped.q1.bed #to get the number of lines in a file

Here is what my output looks like:

Code Block
titleoutput from the code above
wc -l yeast_pairedend_sort.mapped.q1.bed
528976 yeast_pairedend_sort.mapped.q1.bed
more yeast_pairedend_sort.mapped.q1.bed
chrI    219    320    HWI-ST1097:127:C0W5VACXX:5:2212:10568:79659/1    37    +
chrI    266    344    HWI-ST1097:127:C0W5VACXX:5:2115:19940:13862/2    29    +
chrI    368    469    HWI-ST1097:127:C0W5VACXX:5:2115:19940:13862/1    29    -
chrI    684    785    HWI-ST1097:127:C0W5VACXX:5:2212:10568:79659/2    37    -
chrI    871    955    HWI-ST1097:127:C0W5VACXX:5:1103:4918:43976/2    29    +
chrI    871    948    HWI-ST1097:127:C0W5VACXX:5:1104:2027:42518/2    29    +
chrI    871    948    HWI-ST1097:127:C0W5VACXX:5:1109:3153:38695/2    29    +
chrI    871    948    HWI-ST1097:127:C0W5VACXX:5:2109:6222:11815/2    29    +
chrI    871    948    HWI-ST1097:127:C0W5VACXX:5:2113:5002:59471/2    29    +
chrI    871    948    HWI-ST1097:127:C0W5VACXX:5:2113:7803:87146/2    29    +
chrI    971    1072    HWI-ST1097:127:C0W5VACXX:5:1103:4918:43976/1    29    -
chrI    978    1079    HWI-ST1097:127:C0W5VACXX:5:1104:2027:42518/1    29    -
chrI    978    1079    HWI-ST1097:127:C0W5VACXX:5:1109:3153:38695/1    29    -
chrI    978    1079    HWI-ST1097:127:C0W5VACXX:5:2109:6222:11815/1    29    -
chrI    978    1079    HWI-ST1097:127:C0W5VACXX:5:2113:5002:59471/1    29    -
chrI    978    1079    HWI-ST1097:127:C0W5VACXX:5:2113:7803:87146/1    29    -
chrI    978    1079    HWI-ST1097:127:C0W5VACXX:5:2203:1231:50183/1    37    -

Note the "stacks" of reads that are occurring on similar coordinates on the same strand of the genome.  We'll deal with those in the bedtools merge section.

See also: bedtools bedtobam, if you need to get back to a bam file from a bed file (some programs take bam files as input).  Documentation here.

bedtools coverage: how much of the genome does my data cover?

...

Expand
titleclick here to see the code for bedtools coverage
Code Block
languagebash
titlesolution code
module load bedtools #again, if not already loaded
bedtools coverage -a yeast_pairedend_sort.mapped.q1.bed -b sacCer3.chrom.sizes.bed > sacCer3coverage.bed
more sacCer3coverage.bed #this file should have 17 lines, one for each chromosome

And here is what my output looks like:

Code Block
languagebash
more sacCer3coverage.bed
chrI      1    230218    7972     128701    230217    0.5590421
chrII     1    813184    35818    539222    813183    0.6631004
chrIII    1    316620    13701    199553    316619    0.6302623
chrIV     1    1531933   70633    1026387   1531932   0.6699951
chrIX     1    439888    15953    276571    439887    0.6287319
chrM      1    85779     3264     58599     85778     0.6831472
chrV      1    576874    26918    381078    576873    0.6605926
chrVI     1    270161    10662    167222    270160    0.6189740
chrVII    1    1090940   49762    722821    1090939   0.6625677
chrVIII   1    562643    23424    356421    562642    0.6334774
chrX      1    745751    30743    472357    745750    0.6333986
chrXI     1    666816    27950    446567    666815    0.6697015
chrXII    1    1078177   48155    658373    1078176   0.6106359
chrXIII   1    924431    40054    618798    924430    0.6693833
chrXIV    1    784333    32565    513382    784332    0.6545468
chrXV     1    1091291   47871    710376    1091290   0.6509507
chrXVI    1    948066    43531    612122    948065    0.6456540

It's worth noting that just computing coverage over the genome isn't the most useful thing, but you might compute coverage over a set of genes or regions of interest.  Coverage is really useful coupled with intersect or subtract as well.

bedtools merge: collapsing bookended elements (or elements within a certain distance)

When we originally examined the bed files produced from our BAM file, we can see many reads that overlap over the same interval.  While this level of detail is useful, for some analyses, we can collapse each read into a single line, and indicate how many reads occurred over that genomic interval.  We can accomplish this using bedtools merge.

Code Block
languagebash
bedtools merge [OPTIONS] -i experiment.bed > experiment.merge.bed

Bedtools merge also directs the output to standard out, to make sure to point the output to a file or a program.  While we haven't discussed the options for each bedtools function in detail, here they are very important.  Many of the options define what to do with each column (-c) of the output (-o).  This defines what type of operation to perform on each column, and in what order to output the columns.  Standard bed6 format is chrom, start, stop, name, score, strand and controlling column operations allows you to control what to put into each column of output.  The valid operations defined by the -o operation are as follows:

 

Expand
titlevalid operations using the -o option
  • sum, min, max, absmin, absmax,
  • mean, median,
  • collapse (i.e., print a delimited list (duplicates allowed)),
  • distinct (i.e., print a delimited list (NO duplicates allowed)),
  • count
  • count_distinct (i.e., a count of the unique values in the column)

For this exercise, we'll be summing the number of reads over a region to get a score column, using distinct to choose a name, and using distinct again to keep track of the strand.  For the -c options, define which columns to operate on, in the order you want the output.  In this case, to keep the standard bed format, we'll list as -c 5,4,6 and -o count_distinct,sum,distinct, to keep the proper order of name, score, strand.  Strandedness can also be controlled with merge, using the -s option. 

...

bedtools closest: when you want to know how far your regions are from a test set

...

The manual page for bedtools closest has a really nice image of how closest behaves with overlapping options.  Bedtools closest first looks for any overlaps of B with A, if it finds an overlap, the overlap in B with the highest proportional overlap with A is reported.  If there are no overlaps, then it looks for the closest genomic feature proximal to A (using distance from the start or end of A to do this).

Code Block
languagebash
titlebedtools intersect options
bedtools closest [OPTIONS] -a <FILE> \
                           -b <FILE1, FILE2, ..., FILEN>

Much like bedtools intersect, bedtools closest takes an A file and a series of B files.  So if you wanted to determine the distance of your regions of interest to several different classes of genes, bedtools closest would be a useful tool for that analysis. 

 

Expand
titlebedtools closest options
  • s:   Require same strandedness. That is, find the closest feature in B that overlaps A on the _same_ strand. By default, overlaps are reported without respect to strand.
  • S:   Require opposite strandedness. That is, find the closest featurein B that overlaps A on the _opposite_ strand. By default, overlaps are reported without respect to strand.
  • d:   In addition to the closest feature in B, report its distance to A as an extra column. The reported distance for overlapping features will be 0.
  • D:   Like -d, report the closest feature in B, and its distance to A as an extra column. However unlike -d, use negative distances to report upstream features.

    • ref Report distance with respect to the reference genome.  B features with a lower (start, stop) are upstream.
    • a Report distance with respect to A. When A is on the - strand, “upstream” means B has a higher (start,stop).
    • b Report distance with respect to B. When B is on the - strand, “upstream” means A has a higher (start,stop).
  • io: Ignore features in B that overlap A. That is, we want close, yet not touching features only.
  • iu: Ignore features in B that are upstream of features in A. This option requires -D and follows its orientation rules for determining what is “upstream”.
  • id: Ignore features in B that are downstream of features in A. This option requires -D and follows its orientation rules for determining what is “downstream”

  • names: When using multiple databases (-b), provide an alias for each that will appear instead of a file Id when also printing the DB record.

In this section, we'll intersect the human_rnaseq_bwa_sort.mapped.q1.merge.bed file with some protein coding gene from Gencode (hg19).

bedtools subtract:  removing features from your bed file

bedtools subtract:

Code Block
languagebash
titlebedtools intersect options
bedtools closest [OPTIONS] -a <FILE> \
                           -b <FILE1, FILE2, ..., FILEN>

g

Expand
titlebedtools closest options
  • s:   Require same strandedness. That is, find the closest feature in B that overlaps A on the _same_ strand. By default, overlaps are reported without respect to strand.
  • S:   Require opposite strandedness. That is, find the closest featurein B that overlaps A on the _opposite_ strand. By default, overlaps are reported without respect to strand.
  • d:   In addition to the closest feature in B, report its distance to A as an extra column. The reported distance for overlapping features will be 0.
  • D:   Like -d, report the closest feature in B, and its distance to A as an extra column. However unlike -d, use negative distances to report upstream features.

    • ref Report distance with respect to the reference genome.  B features with a lower (start, stop) are upstream.
    • a Report distance with respect to A. When A is on the - strand, “upstream” means B has a higher (start,stop).
    • b Report distance with respect to B. When B is on the - strand, “upstream” means A has a higher (start,stop).
  • io: Ignore features in B that overlap A. That is, we want close, yet not touching features only.
  • iu: Ignore features in B that are upstream of features in A. This option requires -D and follows its orientation rules for determining what is “upstream”.
  • id: Ignore features in B that are downstream of features in A. This option requires -D and follows its orientation rules for determining what is “downstream”

  • names: When using multiple databases (-b), provide an alias for each that will appear instead of a file Id when also printing the DB record.

A little bit of filtering, using awk

...