You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 16 Next »

Now that we have a bam file with only the reads we want included, we can do some more sophisticated analysis using bedtools.  Bedtools changes from version to version, and here we are using version 2.22, the newest version, and what is currently installed on stampede.

First, login to stampede and make a directory in scratch called bedtools in your scratch folder.  Then copy your filtered bam file from the samtools section into this folder.

ssh user@stampede.tacc.utexas.edu
cds
mkdir bedtools
cp yeastpairedend.filtered.bed /bedtools
cd bedtools

Converting a bam file to a fastq file

Sometimes, especially when working with external data, we need to go from a bam file back to a fastq file.  This can be useful for re-aligning reads using a different aligner, different settings on the original aligner used.  It can also be useful for extracting the sequence of interesting regions of the genome after you have manipulated your bam file.

For this exercise, you'll be using bamtofastq.  This function takes an aligned bam file as input and outputs a fastq format file.  You can use the options if you have paired end data to output R1 and R2 reads for your fastq file.  This type of function is especially useful if you need to to  analyze sequences after you've compared several bam or bed files.

bedtools bamtofastq -i input.bam -fq output.fastq

Exercise 1: convert bam to fastq and look at the quality scores

solution code
module load bedtools
bedtools bamtofastq -i input.bam -fq output.fastq
more output.fastq

Bed file format: converting a bam file into a bed file.

While it's useful to be able to look at the fastq file, many analyses will be easiest to perform in bed format.  Bed format is a simple tab delimited format that designates various properties about segments of the genome, defined by the chromosome, start coordinates and end coordinates.  Bedtools provides a simple utility to convert bam files over into bed files, termed bamtobed.

bedtools bamtobed -i input.bam > output.bed

Note that the output will be piped to standard out unless you redirect to a program (head, more, less) or a file (output.bed).

Exercise 2: Convert the filtered yeast paired end bam to bed using bamtobed, look at your file in more, and find the number of lines in the file

solution code
module load bedtools
bedtools bamtobed -i input.bam > output.bed

more output.bed #to examine the bed file visually
wc -l output.bed #to get the number of lines in a file

use ctrl+c to quit more

Bedtools Coverage: how much of the genome does my data cover?

One way of characterizing data is to understand what percentage of the genome your data covers.  What type of experiment you performed should affect the coverage of your data.  A ChIP-seq experiment will cover binding sites, and a RNA-seq experiment will cover expressed transcripts.  Bedtools coverage allows you to compare one bed file to another and compute the breadth and depth of coverage. 

bedtools coverage -a experiment.bed -b reference_file.bed

The resulting output will contain several additional columns which summarize this information:

 

After each interval in B, coverageBed will report:

  1. The number of features in A that overlapped (by at least one base pair) the B interval.
  2. The number of bases in B that had non-zero coverage from features in A.
  3. The length of the entry in B.
  4. The fraction of bases in B that had non-zero coverage from features in A.

For this exercise, we'll use a bed file that summarizes the S. cerevisiae genome, version 3 (aka sacCer3).  For this class, I've made a bed file out of the genome, using the file sacCer3.chrom.sizes.  First go and copy the file over from my scratch directory:

cd bedtools
cp /scratch/01786/awh394/core_ngs/day4_2015/sacCer3.chrom.sizes.bed .

Now use bedtools coverage to find the coverage of the file output.bed over the sacCer3 genome and examine the output coverage.

Exercise 3: Find the coverage of your bed file over the sacCer3 genome

solution code
module load bedtools
bedtools coverage -a output.bed -b sacCer3.chrom.sizes.bed > sacCer3coverage.bed
more sacCer3coverage.bed #this file should have 17 lines, one for each chromosome

Bedtools merge: collapsing bookended elements (or elements within a certain distance)

When we originally examined the bed files produced from our bam file, we can see many reads that overlap over the same interval.  While this level of detail is useful, for some analyses, we can collapse each read into a single line, and indicate how many reads occured over that genomic interval.  We can accomplish this using bedtools merge.

bedtools merge [OPTIONS] -i experiment.bed > experiment.merge.bed

Bedtools merge also directs the output to standard out, to make sure to point the output to a file or a program.  While we haven't discussed the options for each bedtools function in detail, here they are very important.  Many of the options define what to do with each column (-c) of the output (-o).  This defines what type of operation to perform on each column, and in what order to output the columns.  Standard bed6 format is chrom, start, stop, name, score, strand and controlling column operations allows you to control what to put into each column of output.  The valid operations defined by the -o operation are as follows:

 

 

  • sum, min, max, absmin, absmax,
  • mean, median,
  • collapse (i.e., print a delimited list (duplicates allowed)),
  • distinct (i.e., print a delimited list (NO duplicates allowed)),
  • count
  • count_distinct (i.e., a count of the unique values in the column)

For this exercise, we'll be summing the number of reads over a region to get a score column, using distinct to choose a name, and using distinct again to keep track of the strand.  For the -c options, define which columsn to operate on, in the order you want the output.  In this case, to keep the standard bed format, we'll list as -c 5,4,6 and -o distinct,sum,distinct, to keep the proper order of name, score, strand.

Exercise 4: Use bedtools merge to merge an experiment, look at the output and see how many lines there are in the file

Hint: make sure to remove whitespace between lists for the -c and -o options!

solution code
bedtools merge -c 4,5,6 -o distinct,sum,distinct -i output.bed > output.merge.bed
more output.merge.bed
wc -l output.merge.bed

Bedtools intersect: identifying where two experiments overlap

One useful way to compare two experiments (especially biological replicates, or similar experiments in two yeast strains/cell lines/mouse strains) is to compare where reads in one experiment overlap with reads in another experiment.  Bedtools offers a simple way to do this using the intersect function.

 

bedtools intersect [OPTIONS] -a <FILE> \
                             -b <FILE1, FILE2, ..., FILEN>

The intersect function has many options that control how to report the intersection.  We'll be focusing on just a few of these options, listed below.

-a and -b indicate what files to intersect.  in -b, you can specify one, or several files to intersect with the file specified in -a.

-

wa:   Write the original entry in A for each overlap.

wb:   Write the original entry in B for each overlap. Useful for knowing what A overlaps. Restricted by -f and -r.

loj:   Perform a “left outer join”. That is, for each feature in A report each overlap with B. If no overlaps are found, report a NULL feature for B.

wo:   Write the original A and B entries plus the number of base pairs of overlap between the two features. Only A features with overlap are reported. Restricted by -f and -r.

wao: Write the original A and B entries plus the number of base pairs of overlap between the two features. However, A features w/o overlap are also reported with a NULL B feature and overlap = 0. Restricted by -f and -r.

 

Exercise 5: Intersect two experiments using intersect

solution code
solution goes here

 

 

  • No labels