Analysis using BEDTools

The BED format

BED (Browser Extensible Data) format is a simple text format for location-oriented data (genomic regions) developed to support UCSC Genome Browser tracks. Standard BED files have 3 to 6 Tab-separated columns, although up to 12 columns are defined. (Read more about the UCSC Genome Browser's official BED format.)

Memorize the 6 main BED fields

These 6 BED fields are so important that you should memorize them. Keep repeating "chrom, start, end, name, score, strand" until the words trip off your tongue

contig (required) – string naming the chromosome or othre contig
start (required) – the 0-based start position of the region
end (required) – the 1-based end position of the region
name (optional) – an arbitrary string describing the region
- for BED files loaded as UCSC Genome Browser tracks, this text is displayed above the region
score (optional) – an integer score for the region
- for BED files to be loaded as UCSC Genome Browser tracks, this should be a number between 0 and 1000, higher = "better"
- for non-GenBrowse BED files, this can be any integer value (e.g. the length of the region)
strand (optional) - a single character describing the region's strand
- + – plus (Watson) strand region
- - – minus (Crick) strand region
- . – region is not associated with a strand (e.g. a transcription factor binding region)

Important rules for BED format:

The number of fields per line must be consistent throughout any single BED file
- e.g. they must all have 3 fields or all have 6 fields
The first base on a contig is numbered 0
- versus 1 for BAM file positions
- so the a BED start of 99 is actually the 100th base on the contig
- but end positions are 1-based
  - so a BED end of 200 is the 200th base on the contig
- the length of a BED region is end - start
  - not end - start + 1, as it would be if both coordinates with 0-based or both 1-based
- this difference is the single greatest source of errors dealing with BED files!

BED files that have at least 3 fields See supported data formats for custom tracks for more information and examples.

BEDTools overview

Exercises

Setup

BED format practice 1

copy and paste exercise files

cds
mkdir bedtools_practice
cd bedtools_practice
cp /corral-repl/utexas/BioITeam/core_ngs_tools/yeast_stuff/saccharomyces_cerevisiae_R64-1-1_20110208.gff .
cp /corral-repl/utexas/BioITeam/core_ngs_tools/yeast_stuff/*sort.bam .
cp /corral-repl/utexas/BioITeam/core_ngs_tools/yeast_stuff/*.bai .
cp /corral-repl/utexas/BioITeam/core_ngs_tools/human_stuff/bedtools_sample/*bed .

module swap intel gcc/4.7.1
module load bedtools

Q: Convert saccharomyces_cerevisiae_R64-1-1_20110208.gff into a 3 column bed file that includes 'gene' feature.

The gff file is under /corral-repl/utexas/BioITeam/core_ngs_tools/yeast_stuff

Answer

conversion with awk

cat saccharomyces_cerevisiae_R64-1-1_20110208.gff | awk 'BEGIN{FS="\t";OFS="\t"} $3=="gene" {print $1,$4-1,$5}' > sc_gene_3c.bed

BED format practice 2

Q: Convert saccharomyces_cerevisiae_R64-1-1_20110208.gff into a 4 column bed file that includes 'gene' feature (4th column has gene IDs)

Answer

conversion with awk

cat /corral-repl/utexas/BioITeam/core_ngs_tools/yeast_stuff/saccharomyces_cerevisiae_R64-1-1_20110208.gff | \
awk 'BEGIN{FS="\t";OFS="\t"} $3=="gene" {split($9,x,";") ; $9=substr(x[1],4) ; print $1,$4-1,$5,$9}' > sc_gene.bed

BEDTools multicov

BEDTools is a great utility for comparing genomic features in BAM, BED, VCF, and GFF formats. The documentation is well written in great detail. Here, we will practice three commonly used sub-commands: multicov, merge, and intersect.

We are going to use it to count the number of reads that map to each gene in the genome. Load the module and check out the help for bedtools and the multicov specific command that we are going to use:

bedtools
bedtools multicov

The multicov command takes a feature file (GFF/BED/VCF) and counts how many reads are in certain regions from many input files. By default it counts how many reads overlap the feature on either strand, but it can be made specific with the -s option.

Note: Remember that the chromosome names in your gff file should match the way the chromosomes are named in the reference fasta file used in the mapping step. For example, if BAM file used for mapping contains chr1, chrX etc, the GFF file must also call the chromosomes as chr1, chrX and so on.

In order to use the bedtools command on our data, do the following:

bedtools multicov -bams yeast_chip_sort.bam yeast_chip2_sort.bam -bed sc_gene.bed > gene_counts.txt

Then take a peek at the data...

head gene_counts.txt

BEDTools merge

Output of analysis could have overlapped genomic regions. Histone marks ChIP-seq is a good example. Peak calling tools require an estimated peak width for accurate peak identification and input correction. In contrast to sequence specific transcription factors, occupancies of histone marks are broad, variable, and overlapped. These overlapped regions need to be merged.

There are two ChIP-seq peak files: ChIP1_peak.bed and ChIP2_peak.bed.

Count how many peaks in each file:

wc -l *bed

Merge the peaks then count the number of peaks:

bedtools merge -i ChIP1_peak.bed | wc -l
bedtools merge -i ChIP2_peak.bed | wc -l

Q: Extend each peak by 500 bp upstream and 500 bp downstream, then merge and count the number of peaks

Answer

conversion with awk

cat ChIP1_peak.bed | awk 'BEGIN{OFS="\t"} {print $1,$2-500,$3+500}' | bedtools merge -i - | wc -l

If you want to save the merged bed files above, you can redirect the standard outputs instead of wc -l.

BEDTools intersect

Peak overlap analysis between two peak files is very easy using BEDTools intersect.

bedtools intersect -a ChIP1_peak.bed -b ChIP2_peak.bed | wc -l

In real analysis, what you want to do would be more complicated than simple 'intersect' example above. For instance, you may want to i) merge within a file ii) extend peak region iii) intersect. Here is one liner.

bedtools intersect \
-a <(cat ChIP1_peak.bed | awk 'BEGIN{OFS="\t"} {print $1,$2-500,$3+500}' | bedtools merge -i -) \
-b <(cat ChIP2_peak.bed | awk 'BEGIN{OFS="\t"} {print $1,$2-500,$3+500}' | bedtools merge -i -) | wc -l

As you could notice, the number of peaks in each file is a lot abundant because we haven't applied any cut-off threshold.

Q: Write one liner for i) merge within a file ii) sort by peak score iii) take top 1000 peaks vi) re-sort to make bed format v) extend peak region iv) intersect

Answer

conversion with awk

bedtools intersect \
-a <(cat ChIP1_peak.bed | sort -k5 -nr | head -1000 | sort -k1,1 -k2,2n | awk 'BEGIN{OFS="\t"} {print $1,$2-500,$3+500}' | bedtools merge -i -) \
-b <(cat ChIP2_peak.bed | sort -k5 -nr | head -1000 | sort -k1,1 -k2,2n | awk 'BEGIN{OFS="\t"} {print $1,$2-500,$3+500}' | bedtools merge -i -) | wc -l