Bedtools: Analyzing your aligned experiment

A brief introduction to bedtools

Now that we have a BAM file with only the reads we want included, we can do some more sophisticated analysis using bedtools. Bedtools changes from version to version, and here we are using version 2.22, the newest version, and what is currently installed on stampede. You can check what versions of bedtools are installed by using the following command on stampede:

module spider bedtools

First, log on to the login8 node on stampede and make a directory in scratch called bedtools in your scratch folder. Then copy your filtered BAM file from the samtools section into this folder.

ssh user@login8.stampede.tacc.utexas.edu #if you are not already logged in!
cds
mkdir bedtools
cd samtools
cp yeast_pairedend_sort.mapped.q1.bam ../bedtools
cd ../bedtools

If you were unable to make the filtered and sorted BAM file from the previous section, you can copy it over from my scratch directory:

Click here to expand...

cd bedtools
cp /scratch/01786/awh394/core_ngs/day4_2015/bedtools/yeast_pairedend_sort.mapped.q1.bam .

bedtools bamtofastq: converting a BAM file to a fastq file

Sometimes, especially when working with external data, we need to go from a BAM file back to a fastq file. This can be useful for re-aligning reads using a different aligner, different settings on the original aligner used. It can also be useful for extracting the sequence of interesting regions of the genome after you have manipulated your BAM file.

For this exercise, you'll be using bamtofastq. This function takes an aligned BAM file as input and outputs a fastq format file. You can use the options if you have paired end data to output R1 and R2 reads for your fastq file. This type of function is especially useful if you need to to analyze sequences after you've compared several BAM or bed files.

bedtools bamtofastq -i input.bam -fq output.fastq

Exercise 1: convert BAM to fastq and look at the quality scores

click here to see the code and output

solution code

module load bedtools
bedtools bamtofastq -i yeast_pairedend_sort.mapped.q1.bam -fq yeast_pairedend_sort.mapped.q1.fastq #takes 1-2 minutes
more yeast_pairedend_sort.mapped.q1.fastq

Here is an example of two sequences (and their corresponding quality scores):

two lines of a fastq file

@HWI-ST1097:127:C0W5VACXX:5:2212:10568:79659
TACCCTCCAATTACCCATATCCAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATAT
+
CCCFFFFFHHHHHJJJJJIIJJJJIJJIJJIJJIIIIJJJIJJIJJIJJIJJJJJJJJJJIIGGIGEGAEHFEFFEFFFDEEE@CCEDCDDD>ACBBDCA@
@HWI-ST1097:127:C0W5VACXX:5:2115:19940:13862
TAGGGTAAGTTTGAGATGGTATATACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAA
+
?B@DF2ADHFHHFHJIIIGCIHIGGIJJJJGHIIIGIJEHHIGGGAHEGGFGHIECGIJIIIJIIIIIJJJJJJE>EHDHEEEBCDD?CDDBDDDDDDCDB

As we discussed earlier, the top line is the identifier for the sequence produced, the second line defines which bases were produced, the third line indicates the strand the sequence is aligned to, and the fourth line indicates the ASCII based quality scores for each character in the second line.

Bed file format: converting a BAM file into a bed file

While it's useful to be able to look at the fastq file, many analyses will be easiest to perform in bed format. Bed format is a simple tab delimited format that designates various properties about segments of the genome, defined by the chromosome, start coordinates and end coordinates. Bedtools provides a simple utility to convert BAM files over into bed files, termed bamtobed.

bedtools bamtobed -i input.bam > output.bed #output to a file
bedtools bamtobed -i input.bam | more #output to more

Note that the output will be piped to standard out unless you redirect to a program (head, more, less) or a file (output.bed). Now we'll put this example to use and convert our filtered BAM file from the samtools section into a bed file.

Exercise 2: Convert the filtered yeast paired end BAM to bed using bamtobed, look at your file in more, and find the number of lines in the file

Hint: direct the output to a file first, then use more to look at the converted file visually; use ctrl+c to quit more.

click to see the code and output

solution code

module load bedtools #if you haven't loaded it in for this session
bedtools bamtobed -i yeast_pairedend_sort.mapped.q1.bam > yeast_pairedend_sort.mapped.q1.bed

more yeast_pairedend_sort.mapped.q1.bed #to examine the bed file visually
wc -l yeast_pairedend_sort.mapped.q1.bed #to get the number of lines in a file

Here is what my output looks like:

output from the code above

wc -l yeast_pairedend_sort.mapped.q1.bed
528976 yeast_pairedend_sort.mapped.q1.bed
more yeast_pairedend_sort.mapped.q1.bed
chrI    219    320    HWI-ST1097:127:C0W5VACXX:5:2212:10568:79659/1    37    +
chrI    266    344    HWI-ST1097:127:C0W5VACXX:5:2115:19940:13862/2    29    +
chrI    368    469    HWI-ST1097:127:C0W5VACXX:5:2115:19940:13862/1    29    -
chrI    684    785    HWI-ST1097:127:C0W5VACXX:5:2212:10568:79659/2    37    -
chrI    871    955    HWI-ST1097:127:C0W5VACXX:5:1103:4918:43976/2    29    +
chrI    871    948    HWI-ST1097:127:C0W5VACXX:5:1104:2027:42518/2    29    +
chrI    871    948    HWI-ST1097:127:C0W5VACXX:5:1109:3153:38695/2    29    +
chrI    871    948    HWI-ST1097:127:C0W5VACXX:5:2109:6222:11815/2    29    +
chrI    871    948    HWI-ST1097:127:C0W5VACXX:5:2113:5002:59471/2    29    +
chrI    871    948    HWI-ST1097:127:C0W5VACXX:5:2113:7803:87146/2    29    +
chrI    971    1072    HWI-ST1097:127:C0W5VACXX:5:1103:4918:43976/1    29    -
chrI    978    1079    HWI-ST1097:127:C0W5VACXX:5:1104:2027:42518/1    29    -
chrI    978    1079    HWI-ST1097:127:C0W5VACXX:5:1109:3153:38695/1    29    -
chrI    978    1079    HWI-ST1097:127:C0W5VACXX:5:2109:6222:11815/1    29    -
chrI    978    1079    HWI-ST1097:127:C0W5VACXX:5:2113:5002:59471/1    29    -
chrI    978    1079    HWI-ST1097:127:C0W5VACXX:5:2113:7803:87146/1    29    -
chrI    978    1079    HWI-ST1097:127:C0W5VACXX:5:2203:1231:50183/1    37    -

Note the "stacks" of reads that are occurring on similar coordinates on the same strand of the genome. We'll deal with those in the bedtools merge section.

bedtools coverage: how much of the genome does my data cover?

One way of characterizing data is to understand what percentage of the genome your data covers. What type of experiment you performed should affect the coverage of your data. A ChIP-seq experiment will cover binding sites, and a RNA-seq experiment will cover expressed transcripts. Bedtools coverage allows you to compare one bed file to another and compute the breadth and depth of coverage.

bedtools coverage -a experiment.bed -b reference_file.bed

The resulting output will contain several additional columns which summarize this information:

bedtools coverage output

After each interval in B, coverageBed will report:

The number of features in A that overlapped (by at least one base pair) the B interval.
The number of bases in B that had non-zero coverage from features in A.
The length of the entry in B.
The fraction of bases in B that had non-zero coverage from features in A.

For this exercise, we'll use a bed file that summarizes the S. cerevisiae genome, version 3 (aka sacCer3). For this class, I've made a bed file out of the genome, using the file sacCer3.chrom.sizes. First go and copy the file over from my scratch directory:

cd bedtools #if you aren't already there
cp /scratch/01786/awh394/core_ngs/day4_2015/sacCer3.chrom.sizes.bed .

Use more to take a quick look at the file...

more sacCer3.chrom.sizes.bed
chrIV     1    1531933
chrXV     1    1091291
chrVII    1    1090940
chrXII    1    1078177
chrXVI    1    948066
chrXIII   1    924431
chrII     1    813184
chrXIV    1    784333
chrX      1    745751
chrXI     1    666816
chrV      1    576874
chrVIII   1    562643
chrIX     1    439888
chrIII    1    316620
chrVI     1    270161
chrI      1    230218
chrM      1    85779

The format is bed3 - just chrom, start (which is always 1) and stop, which is always the length of the chromosome, for this type of bed file.

Now use bedtools coverage to find the coverage of the file output.bed over the sacCer3 genome and examine the output coverage.

Exercise 3: Find the coverage of your bed file over the sacCer3 genome

click here to see the code for bedtools coverage

solution code

module load bedtools #again, if not already loaded
bedtools coverage -a yeast_pairedend_sort.mapped.q1.bed -b sacCer3.chrom.sizes.bed > sacCer3coverage.bed
more sacCer3coverage.bed #this file should have 17 lines, one for each chromosome

And here is what my output looks like:

more sacCer3coverage.bed
chrI      1    230218    7972     128701    230217    0.5590421
chrII     1    813184    35818    539222    813183    0.6631004
chrIII    1    316620    13701    199553    316619    0.6302623
chrIV     1    1531933   70633    1026387   1531932   0.6699951
chrIX     1    439888    15953    276571    439887    0.6287319
chrM      1    85779     3264     58599     85778     0.6831472
chrV      1    576874    26918    381078    576873    0.6605926
chrVI     1    270161    10662    167222    270160    0.6189740
chrVII    1    1090940   49762    722821    1090939   0.6625677
chrVIII   1    562643    23424    356421    562642    0.6334774
chrX      1    745751    30743    472357    745750    0.6333986
chrXI     1    666816    27950    446567    666815    0.6697015
chrXII    1    1078177   48155    658373    1078176   0.6106359
chrXIII   1    924431    40054    618798    924430    0.6693833
chrXIV    1    784333    32565    513382    784332    0.6545468
chrXV     1    1091291   47871    710376    1091290   0.6509507
chrXVI    1    948066    43531    612122    948065    0.6456540

bedtools merge: collapsing bookended elements (or elements within a certain distance)

When we originally examined the bed files produced from our BAM file, we can see many reads that overlap over the same interval. While this level of detail is useful, for some analyses, we can collapse each read into a single line, and indicate how many reads occurred over that genomic interval. We can accomplish this using bedtools merge.

bedtools merge [OPTIONS] -i experiment.bed > experiment.merge.bed

Bedtools merge also directs the output to standard out, to make sure to point the output to a file or a program. While we haven't discussed the options for each bedtools function in detail, here they are very important. Many of the options define what to do with each column (-c) of the output (-o). This defines what type of operation to perform on each column, and in what order to output the columns. Standard bed6 format is chrom, start, stop, name, score, strand and controlling column operations allows you to control what to put into each column of output. The valid operations defined by the -o operation are as follows:

valid operations using the -o option

sum, min, max, absmin, absmax,
mean, median,
collapse (i.e., print a delimited list (duplicates allowed)),
distinct (i.e., print a delimited list (NO duplicates allowed)),
count
count_distinct (i.e., a count of the unique values in the column)

For this exercise, we'll be summing the number of reads over a region to get a score column, using distinct to choose a name, and using distinct again to keep track of the strand. For the -c options, define which columns to operate on, in the order you want the output. In this case, to keep the standard bed format, we'll list as -c 5,4,6 and -o count_distinct,sum,distinct, to keep the proper order of name, score, strand. Strandedness can also be controlled with merge, using the -s option.

Exercise 4: Use bedtools merge to merge an experiment, look at the output and see how many lines there are in the file.

Hint: make sure to remove whitespace between lists for the -c and -o options!

click here to see the bedtools merge code and output

solution code

bedtools merge -s -c 4,5,6 -o count_distinct,sum,distinct -i yeast_pairedend_sort.mapped.q1.bed > yeast_pairedend_sort.mapped.q1.merge.bed
more yeast_pairedend_sort.mapped.q1.merge.bed
wc -l yeast_pairedend_sort.mapped.q1.merge.bed

wc -l yeast_pairedend_sort.mapped.q1.merge.bed
40319 yeast_pairedend_sort.mapped.q1.merge.bed #without the -s option
76601 yeast_pairedend_sort.mapped.q1.merge.bed #with the -s option

more yeast_pairedend_sort.mapped.q1.merge.bed 
chrI    219     344     2    66     +
chrI    368     469     1    29     -
chrI    684     785     1    37     -
chrI    871     955     6    174    +
chrI    971     1079    7    211    -
chrI    1216    1322    6    157    +
chrI    1347    1437    6    157    -
chrI    2892    2993    14   406    +
chrI    3010    3111    1    37     +
chrI    3013    3107    14   406    -

bedtools intersect: identifying where two experiments overlap (or don't overlap)

One useful way to compare two experiments (especially biological replicates, or similar experiments in two yeast strains/cell lines/mouse strains) is to compare where reads in one experiment overlap with reads in another experiment. Bedtools offers a simple way to do this using the intersect function.

bedtools intersect options

bedtools intersect [OPTIONS] -a <FILE> \
							 -b <FILE1, FILE2, ..., FILEN>

The intersect function has many options that control how to report the intersection. We'll be focusing on just a few of these options, listed below.

-a and -b indicate what files to intersect. in -b, you can specify one, or several files to intersect with the file specified in -a.

bedtools intersect options

wa: Write the original entry in A for each overlap.
wb: Write the original entry in B for each overlap. Useful for knowing what A overlaps. Restricted by -f and -r.
loj: Perform a “left outer join”. That is, for each feature in A report each overlap with B. If no overlaps are found, report a NULL feature for B.
wo: Write the original A and B entries plus the number of base pairs of overlap between the two features. Only A features with overlap are reported. Restricted by -f and -r.
wao: Write the original A and B entries plus the number of base pairs of overlap between the two features. However, A features w/o overlap are also reported with a NULL B feature and overlap = 0. Restricted by -f and -r.
f: Minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp).
v: Only report those entries in A that have no overlap in B. Restricted by -f and -r. Useful to report what doesn't overlap, the inverse of typical usage.
names: When using multiple databases (-b), provide an alias for each that will appear instead of a file Id when also printing the DB record.

In this section, we'll intersect two human experiments - one from sequencing RNA, and one from sequencing micro RNA. Copy these files over to your directory:

copy some files over to intersect

cds
mkdir intersect
cd intersect
cp /scratch/02423/nsabell/core_ngs/alignment/bam/human_rnaseq_bwa.bam .
cp /scratch/02423/nsabell/core_ngs/alignment/bam/human_mirnaseq_hg19.bam .

click here to see my output on copying files

-rwxrwxr-x 1 awh394 G-801021  19M May 27 14:17 human_mirnaseq_hg19.bam
-rwxrwxr-x 1 awh394 G-801021 6.6M May 27 14:17 human_rnaseq_bwa.bam

Before we can intersect these files, we need to perform the pipeline we used in samtools to index, sort and filter the files, and bedtools to convert from BAM over to bed, then collapse down the files using merge. Below is a little workflow to help you through it on the files you just copied above.

Click here to see the samtools/bedtools workflow

My output (for length of bed files) is in the comments.

a samtools/bedtools workflow

module load samtools #if you haven't loaded it up this session

#sort both files
samtools sort human_mirnaseq_hg19.bam human_mirnaseq_hg19_sort # will take 1-2 minutes
samtools sort human_rnaseq_bwa.bam human_rnaseq_bwa_sort # will take 1-2 minutes

#index the new files
samtools index human_mirnaseq_hg19_sort.bam
samtools index human_rnaseq_bwa_sort.bam

#filter the sorted files, reindex the new filtered files
samtools view -b -F 0x04 -q 1 -o human_mirnaseq_hg19_sort.mapped.q1.bam human_mirnaseq_hg19_sort.bam
samtools view -b -F 0x04 -q 1 -o human_rnaseq_bwa_sort.mapped.q1.bam human_rnaseq_bwa_sort.bam
samtools index human_mirnaseq_hg19_sort.mapped.q1.bam
samtools index human_rnaseq_bwa_sort.mapped.q1.bam

#convert filtered bam files to bed format
module load bedtools #if you haven't loaded it in for this session

bedtools bamtobed -i human_mirnaseq_hg19_sort.mapped.q1.bam > human_mirnaseq_hg19_sort.mapped.q1.bed
bedtools bamtobed -i human_rnaseq_bwa_sort.mapped.q1.bam > human_rnaseq_bwa_sort.mapped.q1.bed

#check the length of the files:
wc -l *.bed
#164806 human_mirnaseq_hg19_sort.mapped.q1.bed
# 22538 human_rnaseq_bwa_sort.mapped.q1.bed

#merge the bed files, check the length again
bedtools merge -s -c 4,5,6 -o count_distinct,sum,distinct -i human_mirnaseq_hg19_sort.mapped.q1.bed > human_mirnaseq_hg19_sort.mapped.q1.merge.bed
bedtools merge -s -c 4,5,6 -o count_distinct,sum,distinct -i human_rnaseq_bwa_sort.mapped.q1.bed > human_rnaseq_bwa_sort.mapped.q1.merge.bed

wc -l *.merge.bed 
#14794 human_mirnaseq_hg19_sort.mapped.q1.merge.bed
# 7134 human_rnaseq_bwa_sort.mapped.q1.merge.bed

If we run low on time, you can copy the merged bed files over from my directory on scratch:

click to copy over the merged bed files...

cds
cd intersect
cp /scratch/01786/awh394/core_ngs/day4_2015/intersect/human_mirnaseq_hg19_sort.mapped.q1.merge.bed .
cp /scratch/01786/awh394/core_ngs/day4_2015/intersect/human_rnaseq_bwa_sort.mapped.q1.merge.bed .

Exercise 5: Intersect two experiments using intersect and examine the output

click here for the bedtools intersect code and output

My output is commented in this code block.

cd intersect 
module load bedtools #if you haven't loaded it up yet this session
bedtools intersect -wo -a human_rnaseq_bwa_sort.mapped.q1.merge.bed -b human_mirnaseq_hg19_sort.mapped.q1.merge.bed > hg19_rnaseq_mirnaseq_intersect.bed 

wc -l hg19_rnaseq_mirnaseq_intersect.bed
#38 hg19_rnaseq_mirnaseq_intersect.bed

more hg19_rnaseq_mirnaseq_intersect.bed
#chr1    20987370    20987471    1    37    -    chr1    20987402    20987430    1    12    -    28
#chr1    25555557    25555616    1    37    +    chr1    25555612    25555636    1    2    -    4
#chr1    25555557    25555617    1    37    -    chr1    25555612    25555636    1    2    -    5
#chr1    28906396    28906497    1    37    +    chr1    28906368    28906405    6    246    -    9
#chr1    33245783    33245884    1    37    +    chr1    33245880    33245908    1    24    -    4

Using the options we've specified (-wo) the resulting file will have entries for file A, file B and the number of base pairs overlap between the feature in A and the features in B, but we'll only retain lines where there is an overlap between A and B. We could also use the -v option to only contain areas with NO intersection, or control the intersections with -f and -r options. Bedtools intersect is a powerful tool, and it's always a good idea to ask "what is this code going to do?" while you're testing analysis workflows.

A little bit of filtering, using awk

As a final note, yesterday Nathan taught you about using a lot of unix utilities, including uniq, sort and cut. One last utility I'd like to add, that is very useful for manipulating these types of tab delimited files, is awk. Awk isn't a command, but rather a little text manipulation language in it's own right. While awk can be used to do many different things, here we'll primarily use it to sort tab delimited files based on the values present in those files. That is useful to filter your files for entries on a given chromosome, or greater than/less than a given score. If your dataset is large, this type of filtering can be invaluable! Below is an example of a simple awk script:

a simple awk script

cat file.bed | awk 'BEGIN{FS="\t";OFS="\t";}{if ($6 == '+'){print}}' > file.plusStrand.bed

In the first section, we open the bed file of interest. Then we pipe that filestream to the awk program.
The section: BEGIN{FS="\t";OFS="\t";} tells awk to begin a filter, the input file is tab delimited, and the output file is also tab delimited.
1. Generally, you can leave this section constant (if you are working with tab delimited files).
The second section: {if ($6 == '+'){print}} is our selection and printing criteria.
1. "$6" indicates column 6, and == indicates "equals" or "matches".
The {print} command tells awk to print the whole line if the statement for column 6 evaluates to true.
Thus, the output file only contains those lines which satisfy the criteria in the selection statement.

We can do this filtering on the hg19_rnaseq_mirnaseq_intersect.bed file we just created using bedtools intersect.

cat hg19_rnaseq_mirnaseq_intersect.bed | awk 'BEGIN{FS="\t";OFS="\t";}{if ($6 == "+"){print}}' | more

You could also insist on columns 6 and 12 both being the plus strand as such:

cat hg19_rnaseq_mirnaseq_intersect.bed | awk 'BEGIN{FS="\t";OFS="\t";}{if ($6 == "+" && $12 == "+"){print}}' | more

Page tree

Bedtools: Analyzing your aligned experiment

A brief introduction to bedtools

bedtools bamtofastq: converting a BAM file to a fastq file

Bed file format: converting a BAM file into a bed file

bedtools coverage: how much of the genome does my data cover?

bedtools merge: collapsing bookended elements (or elements within a certain distance)

bedtools intersect: identifying where two experiments overlap (or don't overlap)

A little bit of filtering, using awk