You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 22 Next »

The BED format

BED (Browser Extensible Data) format is a simple text format for location-oriented data (genomic regions) developed to support UCSC Genome Browser tracks. Standard BED files have 3 to 6 Tab-separated columns, although up to 12 columns are defined. (Read more about the UCSC Genome Browser's official BED format.)

Memorize the 6 main BED fields

These 6 BED fields are so important that you should memorize them. Keep repeating "chrom, start, end, name, score, strand" until the words trip off your tongue (smile)

  1. chrom (required) – string naming the chromosome or othre contig
  2. start (required) – the 0-based start position of the region
  3. end (required) – the 1-based end position of the region
  4. name (optional) – an arbitrary string describing the region
    • for BED files loaded as UCSC Genome Browser tracks, this text is displayed above the region
  5. score (optional) – an integer score for the region
    • for BED files to be loaded as UCSC Genome Browser tracks, this should be a number between 0 and 1000, higher = "better"
    • for non-GenBrowse BED files, this can be any integer value (e.g. the length of the region)
  6. strand (optional) - a single character describing the region's strand
    • + – plus (Watson) strand region
    • - – minus (Crick) strand region
    • . – region is not associated with a strand (e.g. a transcription factor binding region)

Important rules for BED format:

  • The number of fields per line must be consistent throughout any single BED file
    • e.g. they must all have 3 fields or all have 6 fields
  • The first base on a contig is numbered 0
    • versus 1 for BAM file positions
    • so the a BED start of 99 is actually the 100th base on the contig
    • but end positions are 1-based
      • so a BED end of 200 is the 200th base on the contig
    • the length of a BED region is end - start
      • not end - start  + 1, as it would be if both coordinates with 0-based or both 1-based
    • this difference is the single greatest source of errors dealing with BED files!

Note that the UCSC Genome Browser also defines many BED-like data formats (e.g. bedGraph, narrowPeak, tagAlign and RNA elements formats). See supported UCSC Genome Browser data formats for more information and examples.

In addition to standard-format BED files, one can create custom BED files that have at least 3 of the standard fields, followed by any number of custom fields. For example:

  • A BED3+ file contains the 3 required BED fields, followed by some number of user-defined columns (all records with the same number)
  • A BED6+ file contains the 3 required BED fields, followed by some number of user-defined columns

As we will see, BEDTools functions require BED3+ input files, or BED6+ if strand-specific operations are requested.

BEDTools overview

The BEDTools suite is a set of utilities for manipulating BED and BAM files. We call it the "Swiss army knife" for genomic region analyses because its sub-commands are so numerous and versatile. Some of the most common bedtools operations perform set-theory functions on regions: intersection (intersect), union (merge), set difference (subtract) – but there are many others. The table below lists some of the most useful sub-commands along with applicable use cases.

Sub-commandDescriptionUse case(s)
bamtobedConvert BAM files to BED format.You want to have the contig, start, end, and strand information for each mapped alignment record in separate fields. Recall that the strand is encoded in a BAM flag (0x10) and the exact end coordinate requires parsing the CIGAR string.
bamtofastqExtract FASTQ sequences from BAM alignment records.You have downloaded a BAM file from a public database, but it was not aligned against the reference version you want to use (e.g. it is hg18 and you want an hg38 alignment). To re-process, you need to start with the original FASTQ sequences.
getfastaGet FASTA entries corresponding to regions.Preparation to run motif analysis, which requires the original sequences, on a set of regions of interest. You must provide FASTA file(s) for the genome/reference used for alignment (e.g. the FASTA file used to build the aligner index).
coverageCompute genome-wide coverage of your regsions, or generate per-base genome-wide signal trace.
  • You have performed a WGS (whole genome sequencing) experiment and want to know if has resulted in the desired coverage depth.
  • Calculate what proportion of the (known) transcriptome is covered by your RNA-seq alignments. Provide the transcript regions as a BED or GFF/GTF file.
  • Produce a per-base genome-wide signal (in bedGraph format) for a ChIP-seq experiment. After conversion to binary bigWig format, such tracks can be configured in the UCSC Genome Browser as custom tracks.
multicovCount overlaps between one or more BAM or BED files and a set of regions of interest.
  • Count RNA-seq alignments that overlap a set of genes of interest. While this task is usually done with a specialized RNA-seq quantification tool (e.g. DESeq2 or EdgeR), bedtools multicov can provide a quick estimate, e.g. for QC purposes.
mergeCombine a set of possibly-overlapping regions into a single set of non-overlapping regions.Collapse overlapping gene annotations into per-strand non-overlapping regions before counting. If this is not done, the source regions will potentially be counted multiple times, once for each (overlapping) target region it intersects.
subtractRemove unwanted regions.Remove rRNA gene regions from a merged gene annotations file before counting.
intersectDetermine the overlap between two sets of regions.Similar to multicov, but can also report (not just count) the overlapping regions.
closestFind the genomic features nearest to a set of regions.For a set of significant ChIP-seq transcription factor (TF) binding regions ("peaks") that have been identified, determine nearby genes that may be targets of TF regulation.

We will explore a few of these functions in our exercises.

BEDTools versions

BEDTools is under active development and is always being refined and extended. Unfortunately, sometimes changes are made that are incompatible with previous BEDTools versions. For example, a major change to the way bedtool merge functions was made after bedtools v2.17.0.

So it is important to know which version of BEDTools you are using, and read the documentation carefully to see if changes have been made since your version. When you run the code below, you should see that the version on TACC is bedtools v2.25.0.

copy and paste exercise files
module load bedtools
bedtools --version

Input format considerations

  • Most BEDTools functions now accept either BAM or BED files as input. 
    • BED format files must be BED3+, or BED6+ if strand-specific operations are requested.
  • When comparing against a set of regions, those regions are usually supplied in either BED or GTF/GFF.
  • All text-format input files (BED, GTF/GFF, VCF) must use Unix line endings (linefeed only).

The most important thing to remember about comparing regions using BEDTools, is that all region files must share the same set contig names and be based on the same reference! For example, if an alignment was performed against a human GRCh38 reference genome from Gencode, annotations from the corresponding GTF annotations.

Exercises

About GTF/GFF annotation files

Annotation files that you retrieve from public databases are often in GTF (Gene Transfer Format) or one of the in GFF (General Feature Format) formats (usually GFF3 these days).

Unfortunately, there are a number of variations of both formats However both GTF and GFF share the first 8 fields (Tab-separated):

  1. seqname - The name of the chromosome or scaffold.
  2. source - Name of the program that generated this feature, or other data source (e.g. database)
  3. feature_type - Type of the feature. Examples of common feature types include:
    • Some examples of common feature types are:
      • CDS (coding sequence), exon
      • gene, transcript
      • start_codon, stop_codon
  4. start - Start position of the feature, with sequence numbering starting at 1.
  5. end - End position of the feature, with sequence numbering starting at 1.
  6. score - A numeric value. Often but not always an integer.
  7. strand - Defined as + (forward), - (reverse), or . (no relevant strand)
  8. frame - For a CDS, one of 0, 1 or 2, specifying the reading frame of the first base; otherwise '.'

The columns will care about are (1) seqname, (3) feature_type and (4,5) start, end. The reason we care is that when working with annotations, we usually only want to look at annotations of a particular type, most commonly gene, but also transcript or exon.

So where is the real annotation information, such as the unique gene ID or gene name? Well, both formats also have a 9th field, which is usually populated by a set of name/value pair attributes, including the feature identifier (e.g., the unique gene, transcript or exon ID), the feature name (e.g. a gene or transcript name), as well as other information (description and biotype being the most common and useful).

Let's take a quick look at a yeast annotation file, sacCer_R64-1-1_20110208.gff to understand this better, starting with the first 8 fields:

mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff .

cat sacCer_R64-1-1_20110208.gff | cut -f 1-8 | head -50

You should see something like this:

Part of the sacCer_R64-1-1_20110208.gff annotation file
##gff-version 3
#date Tue Feb  8 19:50:12 2011
#
# Saccharomyces cerevisiae S288C genome
#
# Features from the 16 nuclear chromosomes labeled chrI to chrXVI,
# plus the mitochondrial genome labeled chrMito and the 2-micron plasmid.
#
# Created by Saccharomyces Genome Database (http://www.yeastgenome.org/)
#
# Weekly updates of this file are available via Anonymous FTP from:
# ftp://ftp.yeastgenome.org/yeast/data_download/chromosomal_feature/saccharomyces_cerevisiae.gff
#
# Please send comments and suggestions to yeast-curator@yeastgenome.org
#
# SGD is funded as a National Human Genome Research Institute Biomedical Informatics Resource from
# the U. S. National Institutes of Health to Stanford University.  The staff of SGD is listed at:
# http://www.yeastgenome.org/SGD-staff.html
#
chrI    SGD     chromosome      1       230218  .       .       .
chrI    SGD     repeat_region   1       62      .       -       .
chrI    SGD     telomere        1       801     .       -       .
chrI    SGD     repeat_region   63      336     .       -       .
chrI    SGD     gene    335     649     .       +       .
chrI    SGD     CDS     335     649     .       +       0
chrI    SGD     repeat_region   337     801     .       -       .
chrI    SGD     nucleotide_match        753     763     .       -       .
chrI    SGD     binding_site    532     544     .       -       .
chrI    SGD     gene    538     792     .       +       .
chrI    SGD     CDS     538     792     .       +       0
chrI    SGD     ARS     650     1791    .       .       .
chrI    SGD     gene    1807    2169    .       -       .
chrI    SGD     CDS     1807    2169    .       -       0
chrI    SGD     gene    2480    2707    .       +       .
chrI    SGD     CDS     2480    2707    .       +       0
chrI    SGD     gene    7235    9016    .       -       .
chrI    SGD     CDS     7235    9016    .       -       0
chrI    SGD     ARS     7997    8547    .       .       .
chrI    SGD     gene    10091   10399   .       +       .
chrI    SGD     CDS     10091   10399   .       +       0
chrI    SGD     gene    11565   11951   .       -       .
chrI    SGD     CDS     11565   11951   .       -       0
chrI    SGD     gene    12046   12426   .       +       .
chrI    SGD     CDS     12046   12426   .       +       0
chrI    SGD     gene    13363   13743   .       -       .
chrI    SGD     CDS     13363   13743   .       -       0
chrI    SGD     gene    21566   21850   .       +       .
chrI    SGD     CDS     21566   21850   .       +       0
chrI    SGD     long_terminal_repeat    22230   22552   .       +       .
chrI    SGD     gene    22395   22685   .       -       .

In addition to comment lines (starting with #), you can see the chrI contig names in column 1 and various feature types in column 3.You can also look at the contents with less, and see that the attributes column information is really ugly.

One of the first things you want to know about your annotation file is what gene fetures it contains. Here's how to find that:

cd $SCRATCH/core_ngs/bedtools
cat sacCer_R64-1-1_20110208.gff | grep -v '^#' | cut -f 3 | sort | uniq -c | sort -k1,1nr | more

(Read more about what's going on here at Piping a histogram.)

You should see something like this.

Part of the sacCer_R64-1-1_20110208.gff annotation file
  7077 CDS
  6607 gene
   480 noncoding_exon
   383 long_terminal_repeat
   376 intron
   337 ARS
   299 tRNA
   190 region
   129 repeat_region
   102 nucleotide_match
    89 transposable_element_gene
    77 snoRNA
    50 LTR_retrotransposon
    32 telomere
    31 binding_site
    27 rRNA
    24 five_prime_UTR_intron
    21 pseudogene
    17 chromosome
    16 centromere
    15 ncRNA
     8 external_transcribed_spacer_region
     8 internal_transcribed_spacer_region
     6 snRNA
     3 gene_cassette
     2 insertion

Next let's create a file that contains only the 6607 gene entries:

conversion with awk
cat sacCer_R64-1-1_20110208.gff | awk 'BEGIN{FS=OFS="\t"}{ if($3=="gene"){print} }' > sc_genes.gff


Q: Convert saccharomyces_cerevisiae_R64-1-1_20110208.gff into a 3 column bed file that includes 'gene' feature.

The gff file is under /corral-repl/utexas/BioITeam/core_ngs_tools/yeast_stuff

conversion with awk
cat saccharomyces_cerevisiae_R64-1-1_20110208.gff | awk 'BEGIN{FS=OFS="\t"}{ if($3=="gene"){print $1,$4-1,$5} }' > sc_gene_3c.bed

Q: Convert saccharomyces_cerevisiae_R64-1-1_20110208.gff into a 4 column bed file that includes 'gene' feature (4th column has gene IDs)

conversion with awk
cat /corral-repl/utexas/BioITeam/core_ngs_tools/yeast_stuff/saccharomyces_cerevisiae_R64-1-1_20110208.gff | \
awk 'BEGIN{FS="\t";OFS="\t"} $3=="gene" {split($9,x,";") ; $9=substr(x[1],4) ; print $1,$4-1,$5,$9}' > sc_gene.bed

BEDTools merge

Create a directory for these exercises, copy over the yeast RNA-seq files we'll need, and load the bedtools module.

Setup for BEDTools exercises
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam .
cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff .

module load bedtools 

About GTF/GFF annotation files

BEDTools multicov

BEDTools is a great utility for comparing genomic features in BAM, BED, VCF, and GFF formats. The documentation is well written in great detail. Here, we will practice three commonly used sub-commands: multicov, merge, and intersect.

We are going to use it to count the number of reads that map to each gene in the genome. Load the module and check out the help for bedtools and the multicov specific command that we are going to use:

bedtools
bedtools multicov

The multicov command takes a feature file (GFF/BED/VCF) and counts how many reads are in certain regions from many input files. By default it counts how many reads overlap the feature on either strand, but it can be made specific with the -s option.

Note: Remember that the chromosome names in your gff file should match the way the chromosomes are named in the reference fasta file used in the mapping step. For example, if BAM file used for mapping contains chr1, chrX etc, the GFF file must also call the chromosomes as chr1, chrX and so on.

In order to use the bedtools command on our data, do the following:

bedtools multicov -bams yeast_chip_sort.bam yeast_chip2_sort.bam -bed sc_gene.bed > gene_counts.txt

Then take a peek at the data...

head gene_counts.txt



  • No labels