You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 35 Next »

The BED format

BED (Browser Extensible Data) format is a simple text format for location-oriented data (genomic regions) developed to support UCSC Genome Browser tracks. Standard BED files have 3 to 6 Tab-separated columns, although up to 12 columns are defined. (Read more about the UCSC Genome Browser's official BED format.)

Memorize the 6 main BED fields

These 6 BED fields are so important that you should memorize them. Keep repeating "chrom, start, end, name, score, strand" until the words trip off your tongue (smile)

  1. chrom (required) – string naming the chromosome or othre contig
  2. start (required) – the 0-based start position of the region
  3. end (required) – the 1-based end position of the region
  4. name (optional) – an arbitrary string describing the region
    • for BED files loaded as UCSC Genome Browser tracks, this text is displayed above the region
  5. score (optional) – an integer score for the region
    • for BED files to be loaded as UCSC Genome Browser tracks, this should be a number between 0 and 1000, higher = "better"
    • for non-GenBrowse BED files, this can be any integer value (e.g. the length of the region)
  6. strand (optional) - a single character describing the region's strand
    • + – plus (Watson) strand region
    • - – minus (Crick) strand region
    • . – region is not associated with a strand (e.g. a transcription factor binding region)

Important rules for BED format:

  • The number of fields per line must be consistent throughout any single BED file
    • e.g. they must all have 3 fields or all have 6 fields
  • The first base on a contig is numbered 0
    • versus 1 for BAM file positions
    • so the a BED start of 99 is actually the 100th base on the contig
    • but end positions are 1-based
      • so a BED end of 200 is the 200th base on the contig
    • the length of a BED region is end - start
      • not end - start  + 1, as it would be if both coordinates with 0-based or both 1-based
    • this difference is the single greatest source of errors dealing with BED files!

Note that the UCSC Genome Browser also defines many BED-like data formats (e.g. bedGraph, narrowPeak, tagAlign and RNA elements formats). See supported UCSC Genome Browser data formats for more information and examples.

In addition to standard-format BED files, one can create custom BED files that have at least 3 of the standard fields, followed by any number of custom fields. For example:

  • A BED3+ file contains the 3 required BED fields, followed by some number of user-defined columns (all records with the same number)
  • A BED6+ file contains the 3 required BED fields, followed by some number of user-defined columns

As we will see, BEDTools functions require BED3+ input files, or BED6+ if strand-specific operations are requested.

BEDTools overview

The BEDTools suite is a set of utilities for manipulating BED and BAM files. We call it the "Swiss army knife" for genomic region analyses because its sub-commands are so numerous and versatile. Some of the most common bedtools operations perform set-theory functions on regions: intersection (intersect), union (merge), set difference (subtract) – but there are many others. The table below lists some of the most useful sub-commands along with applicable use cases.

Sub-commandDescriptionUse case(s)
bamtobedConvert BAM files to BED format.You want to have the contig, start, end, and strand information for each mapped alignment record in separate fields. Recall that the strand is encoded in a BAM flag (0x10) and the exact end coordinate requires parsing the CIGAR string.
bamtofastqExtract FASTQ sequences from BAM alignment records.You have downloaded a BAM file from a public database, but it was not aligned against the reference version you want to use (e.g. it is hg18 and you want an hg38 alignment). To re-process, you need to start with the original FASTQ sequences.
getfastaGet FASTA entries corresponding to regions.Preparation to run motif analysis, which requires the original sequences, on a set of regions of interest. You must provide FASTA file(s) for the genome/reference used for alignment (e.g. the FASTA file used to build the aligner index).
coverageCompute genome-wide coverage of your regsions; generate per-base genome-wide signal trace.
  • You have performed a WGS (whole genome sequencing) experiment and want to know if has resulted in the desired coverage depth.
  • Calculate what proportion of the (known) transcriptome is covered by your RNA-seq alignments. Provide the transcript regions as a BED or GFF/GTF file.
  • Produce a per-base genome-wide signal (in bedGraph format) for a ChIP-seq experiment. After conversion to binary bigWig format, such tracks can be configured in the UCSC Genome Browser as custom tracks.
multicovCount overlaps between one or more BAM files and a set of regions of interest.
  • Count RNA-seq alignments that overlap a set of genes of interest. While this task is usually done with a specialized RNA-seq quantification tool (e.g. DESeq2 or EdgeR), bedtools multicov can provide a quick estimate, e.g. for QC purposes.
mergeCombine a set of possibly-overlapping regions into a single set of non-overlapping regions.Collapse overlapping gene annotations into per-strand non-overlapping regions before counting. If this is not done, the source regions will potentially be counted multiple times, once for each (overlapping) target region it intersects.
subtractRemove unwanted regions.Remove rRNA gene regions from a merged gene annotations file before counting.
intersectDetermine the overlap between two sets of regions.Similar to multicov, but takes BED files as input and can also report (not just count) the overlapping regions.
closestFind the genomic features nearest to a set of regions.For a set of significant ChIP-seq transcription factor (TF) binding regions ("peaks") that have been identified, determine nearby genes that may be targets of TF regulation.

We will explore a few of these functions in our exercises.

BEDTools versions

BEDTools is under active development and is always being refined and extended. Unfortunately, sometimes changes are made that are incompatible with previous BEDTools versions. For example, a major change to the way bedtool merge functions was made after bedtools v2.17.0.

So it is important to know which version of BEDTools you are using, and read the documentation carefully to see if changes have been made since your version. When you run the code below, you should see that the version on TACC is bedtools v2.25.0. (Login to login5.ls5.tacc.utexas.edu first.)

copy and paste exercise files
module load bedtools
bedtools --version

Input format considerations

  • Most BEDTools functions now accept either BAM or BED files as input. 
    • BED format files must be BED3+, or BED6+ if strand-specific operations are requested.
  • When comparing against a set of regions, those regions are usually supplied in either BED or GTF/GFF.
  • All text-format input files (BED, GTF/GFF, VCF) must use Unix line endings (linefeed only).

The most important thing to remember about comparing regions using BEDTools, is that all region files must share the same set contig names and be based on the same reference! For example, if an alignment was performed against a human GRCh38 reference genome from Gencode, annotations from the corresponding GFF/GTF annotations.

About GFF/GTF annotation files

Annotation files that you retrieve from public databases are often in GTF (Gene Transfer Format) or one of the in GFF (General Feature Format) formats (usually GFF3 these days).

Unfortunately, both formats are obscure and hard to work with directly. While bedtools does accept annotation files in GFF/GTF format, you will not like the results. This is because the most useful information in a GFF/GTF file is in a loosly-structured attributes field.

Also unfortunately, there are a number of variations of both annotation formats However both GTF and GFF share the first 8 fields (Tab-separated):

  1. seqname - The name of the chromosome or scaffold.
  2. source - Name of the program that generated this feature, or other data source (e.g. database)
  3. feature_type - Type of the feature. Examples of common feature types include:
    • Some examples of common feature types are:
      • CDS (coding sequence), exon
      • gene, transcript
      • start_codon, stop_codon
  4. start - Start position of the feature, with sequence numbering starting at 1.
  5. end - End position of the feature, with sequence numbering starting at 1.
  6. score - A numeric value. Often but not always an integer.
  7. strand - Defined as + (forward), - (reverse), or . (no relevant strand)
  8. frame - For a CDS, one of 0, 1 or 2, specifying the reading frame of the first base; otherwise '.'

The Tab-separated columns will care about are (1) seqname, (3) feature_type and (4,5) start, end. The reason we care is that when working with annotations, we usually only want to look at annotations of a particular type, most commonly gene, but also transcript or exon.

So where is the real annotation information, such as the unique gene ID or gene name? Both formats also have a 9th field, which is usually populated by a set of name/value pair attributes, and that's where the useful information is (e.g. the unique feature identifier, name, and so forth).

Take a quick look at a yeast annotation file, sacCer_R64-1-1_20110208.gff using less. (Login to login5.ls5.tacc.utexas.edu first.)

mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff .

less sacCer_R64-1-1_20110208.gff

In addition to comment lines (starting with #), you can see the chrI contig names in column 1 and various feature types in column 3. You see also see tags like Name=YAL067C;gene=SEO1; among the attributes on some records, but in general the attributes column information is really ugly.

To summarize, we have two problems to solve:

  1. We only care about a subset of feature types (here genes), and
  2. We want the important annotation information – gene names and IDs – to appear as regular columns instead of weird name/value pairs.

Filter annotations based on desired feature type

One of the first things you want to know about your annotation file is what gene features it contains. Here's how to find that: (Read more about what's going on here at Piping a histogram.)

cd $SCRATCH/core_ngs/bedtools
cat sacCer_R64-1-1_20110208.gff | grep -v '^#' | cut -f 3 | sort | uniq -c | sort -k1,1nr | more

You should see something like this.

Part of the sacCer_R64-1-1_20110208.gff annotation file
  7077 CDS
  6607 gene
   480 noncoding_exon
   383 long_terminal_repeat
   376 intron
   337 ARS
   299 tRNA
   190 region
   129 repeat_region
   102 nucleotide_match
    89 transposable_element_gene
    77 snoRNA
    50 LTR_retrotransposon
    32 telomere
    31 binding_site
    27 rRNA
    24 five_prime_UTR_intron
    21 pseudogene
    17 chromosome
    16 centromere
    15 ncRNA
     8 external_transcribed_spacer_region
     8 internal_transcribed_spacer_region
     6 snRNA
     3 gene_cassette
     2 insertion

Let's create a file that contains only the 6607 gene entries:

Filter GFF gene feature with awk
cat sacCer_R64-1-1_20110208.gff | awk 'BEGIN{FS=OFS="\t"}{ if($3=="gene"){print} }' > sc_genes.gff
wc -l sc_genes.gff 

The line count of sc_genes.gff should be 6607 – one for each gene entry.

Convert GFF/GTF format to BED with feature ID in the name field

Our sc_genes.gff annotation subset now contains only the 6607 genes in the Saccharomyces cerevisiae genome. This addresses our first problem, but entries in this file still have the important information – the gene ID and name – in the loosely-structured 9th attributes field.

If we want to associate reads with features, we need to have the feature names where they are easy to extract!

What most folks to is find some way to convert their GFF/GTF file to a BED file, parsing out some (or all) of the name/value attribute pairs into BED file columns after the standard 6. You can find such conversion programs on the web – or write one yourself. Or you could use the BioITeam conversion script, /work/projects/BioITeam/common/script/gtf_to_bed.pl. While it will not work 100% of the time, it manages to do a decent job on most GFF/GTF files. And it's pretty easy to run.

Let Anna know if you run into problems

If this script doesn't work on your annotation file, please let Anna know. She is always looking for cases where the conversion fails, and will try to fix it.

Here we just give the script the GFF file to convert, plus a 1 that tells it to URL decode weird looking text (e.g. our Note attribute values).

Convert GFF to BED with BioITeam script
/work/projects/BioITeam/common/script/gtf_to_bed.pl sc_genes.gff 1 > sc_genes.converted.bed

The program reads the input file twice – once to gather all the attribute names, and then a second time to write the attribute values in well-defined columns. You'll see output like this:

Convert GFF to BED with BioITeam script
----------------------------------------
Gathering all attribute names for GTF 'sc_genes.gff'...
  urlDecode = 1, tagAttr = tag
Done!
  6607 lines read
  6607 locus entries
  8 attributes found:
(Alias ID Name Note Ontology_term dbxref gene orf_classification)
----------------------------------------
Writing BED output for GTF 'sc_genes.gff'...
Done! Wrote 6607 locus entries from 6607 lines

To find out what the resulting columns are, look at the header line out the output BED file:

Convert GFF to BED with BioITeam script
head -1 sc_genes.converted.bed 

For me the resulting 16 attributes are as follows (they may have a different order for you). I've numbered them for convenience

Converted BED attributes
 1. chrom          2. start   3. end     4. featureType  5. length  6. strand
 7. source         8. frame   9. Alias  10. ID          11. Name   12. Note
13. Ontology_term 14. dbxref 15. gene   16. orf_classification

The final transformation is to do a bit of re-ordering, dropping some fields. We'll do this with awk, becuase cut can't re-order fields. While this is not strictly required, it can be helpful to have the critical fields (including the gene ID) in the 1st 6 columns. We do this separately for the header line and the rest of the file so that the BED file we give bedtools does not have a header (but we know what those fields are). We would normally preserve valuable annotation information such as Ontology_term, dbxref and Note, but drop them here for simplicity.

Re-order the final BED fields
head -1 sc_genes.converted.bed | sed 's/\r//' | awk '
 BEGIN{FS=OFS="\t"}{print $1,$2,$3,$10,$5,$6,$15,$16}
 ' > sc_genes.bed.hdr
tail -n +2 sc_genes.converted.bed | sed 's/\r//' | awk '
 BEGIN{FS=OFS="\t"}
 { if($15 == "") {$15 = $10} # make sure gene name is populated
   print $1,$2,$3,$10,$5,$6,$15,$16}
 ' > sc_genes.bed

One final detail. Annotation files you download may have non-Unix (linefeed-only) line endings. Specifically, they may use Windows line endings (carriage return + linefeed). (Read about Line ending nightmares.) The expression sed 's/\r//' uses the sed (substitution editor) tool to replace carriage return characters ( \r ) with nothing, removing them from the output.

Finally, the 8 re-ordered attributes are:

Converted BED attributes
 1. chrom  2. start               3. end             4. ID      5. length  6. strand
 7. gene   8. orf_classification

**Whew**! That was a lot of work. Welcome to the world of annotation wrangling. It's never pretty! But at least the result is much nicer looking. Examine the results using more or less or head:

Examine our BED-format annotations
cat sc_genes.bed | head -20

Doesn't this look better?

Converted BED attributes
chrI    334     649     YAL069W 315     +       YAL069W Dubious
chrI    537     792     YAL068W-A       255     +       YAL068W-A       Dubious
chrI    1806    2169    YAL068C 363     -       PAU8    Verified
chrI    2479    2707    YAL067W-A       228     +       YAL067W-A       Uncharacterized
chrI    7234    9016    YAL067C 1782    -       SEO1    Verified
chrI    10090   10399   YAL066W 309     +       YAL066W Dubious
chrI    11564   11951   YAL065C 387     -       YAL065C Uncharacterized
chrI    12045   12426   YAL064W-B       381     +       YAL064W-B       Uncharacterized
chrI    13362   13743   YAL064C-A       381     -       YAL064C-A       Uncharacterized
chrI    21565   21850   YAL064W 285     +       YAL064W Verified
chrI    22394   22685   YAL063C-A       291     -       YAL063C-A       Uncharacterized
chrI    23999   27968   YAL063C 3969    -       FLO9    Verified
chrI    31566   32940   YAL062W 1374    +       GDH3    Verified
chrI    33447   34701   YAL061W 1254    +       BDH2    Uncharacterized
chrI    35154   36303   YAL060W 1149    +       BDH1    Verified
chrI    36495   36918   YAL059C-A       423     -       YAL059C-A       Dubious
chrI    36508   37147   YAL059W 639     +       ECM1    Verified
chrI    37463   38972   YAL058W 1509    +       CNE1    Verified
chrI    38695   39046   YAL056C-A       351     -       YAL056C-A       Dubious
chrI    39258   41901   YAL056W 2643    +       GPB2    Verified


Note that value in the 8th column. In the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious. The Dubious ones have no experimental evidence so are generally excluded.

Exercise: How many genes in our sc_genes.bed file are in each category?

Use cut to isolate that field, sort to sort the resulting values into blocks, then uniq -c to count the members of each block.

cut -f 8 sc_genes.bed | sort | uniq -c

You should see this:

    810 Dubious
    897 Uncharacterized
   4896 Verified
      4 Verified|silenced_gene

If you want to further order this output listing the most abundant category first, add another sort statement:

cut -f 8 sc_genes.bed | sort | uniq -c | sort -k 1,1nr

The -k 1,1nr options says to sort on the 1st field of input, using numeric sorting, in reverse order (i.e., largest first). Which produces:

   4896 Verified
    897 Uncharacterized
    809 Dubious
      4 Verified|silenced_gene

About strandedness

By default many bedtools utilities that perform overlapping, consider reads overlapping the feature on either strand, but it can be made specific with the -s or -S option. This strandedness options for bedtools utilities refers the orientation of the R1 read with respect to the feature's (gene's) strand.

  • -s says the R1 read is sense stranded (on the same strand as the gene).
  • -S says the R1 read is antisense stranded (the opposite strand as the gene).

RNA-seq libraries can be constructed with 3 types of strandedness:

  1. sense stranded – the R1 read should be on the same strand as the gene.
  2. antisense stranded – the R1 read should be on the opposite strand as the gene.
  3. unstranded – the R1 could be on either strand

Which type of RNA-seq library you have depends on the library preparation method – so ask your sequencing center! Our yeast RNA-seq library is sense stranded.

If you have a stranded RNA-seq library, you should use either -s or -S to avoid false counting against a gene on the wrong strand.

Exercises

We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted yeast gene annotation file we created above.

Get the RNA-seq BAM

First start an idev session, since we will be doing some significant computation.

Start an idev session
idev -p development -m 20 -A UT-2015-05-18 -N 1 -n 24 --reservation=CCBB

Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

Setup for BEDTools exercises
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .
cp $CORENGS/yeast_rna/sc_genes.bed* . 

Exercises: How many reads pairs are in the yeast_mrna.sort.filt.bam file? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?

samtools flagstat for the different read counts.

samtools view + cut + sort + uniq -c for mapping quality distribution

samtools view + awk for average mapping quality

cd $SCRATCH/core_ngs/bedtools
module load samtools
samtools flagstat yeast_mrna.sort.filt.bam | tee yeast_mrna.flagstat.txt
samtools flagstat output
3323242 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
922114 + 0 duplicates
3323242 + 0 mapped (100.00% : N/A)
3323242 + 0 paired in sequencing
1661699 + 0 read1
1661543 + 0 read2
3323242 + 0 properly paired (100.00% : N/A)
3323242 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%.

To get the distribution of mapping qualities:

samtools view yeast_mrna.sort.filt.bam | cut -f 5 | sort | uniq -c | sort -k2,2n
distribution of mapping qualities
    453 20
   6260 21
    889 22
    326 23
    971 24
   2698 25
    376 26
  12769 27
    268 28
    337 29
    933 30
   1229 31
    345 32
   5977 33
    256 34
    249 35
   1103 36
    887 37
    292 38
   4648 39
   5706 40
    426 41
   1946 42
   1547 43
   1761 44
   6138 45
   1751 46
   3019 47
   3710 48
   3236 49
   4467 50
  15691 51
  25370 52
  16636 53
  18081 54
   7084 55
   2701 56
  59851 57
   2836 58
   2118 59
3097901 60

To compute average mapping quality:

samtools view yeast_mrna.sort.filt.bam | awk '
  BEGIN{FS="\t"; sum=0; tot=0}
  {sum = sum + $5; tot = tot + 1}
  END{print "average:",sum/tot,"for",tot,"reads"}'

Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records.

Use bedtools multicov to count feature overlaps

In this section we'll use bedtools multicov to count RNA-seq reads that overlap our gene features. The bedtools multicov command (http://bedtools.readthedocs.io/en/latest/content/tools/multicov.html) takes a feature file (GFF/BED/VCF) and counts how many reads from one or more input BAM files overlap those feature. The input BAM file(s) must be position-sorted and indexed.

Here's how to run bedtools multicov, directing the standard output to a file:

module load bedtools
bedtools multicov -s -bams yeast_mrna.sort.filt.bam -bed sc_genes.bed > yeast_mrna_gene_counts.bed

Exercise: How may records of output were written? Where is the count of overlaps per output record?

wc -l yeast_mrna_gene_counts.bed

6607 records were written, one for each feature in the sc_genes.bed file.

The overlap count was added as the last field in each output record (here field 9, since the input annotation file had 8 columns).

Exercise: How many features have non-zero overlap counts?

cut -f 9 yeast_mrna_gene_counts.bed | grep -v '^0' | wc -l

Most of the genes (6235/6607) have non-zero read overlap counts.

Exercise: What is the total count of reads mapping to gene features?

cat yeast_mrna_gene_counts.bed | awk '
 BEGIN{FS="\t";sum=0;tot=0}
 {if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
 END{print sum,"overlapping reads in",tot,"records"}'

There are 1144990 overlapping reads in 6235 records

Recall that in the yeast annotations from SGD there are 3 gene classifications: Verified, Uncharacterized and Dubious, and the Dubious ones have no experimental evidence.

Exercise: What is the total count of reads mapping to gene features other than Dubious?

grep -v 'Dubious'
grep -v 'Dubious' yeast_mrna_gene_counts.bed | awk '
 BEGIN{FS="\t";sum=0;tot=0}
 {if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
 END{print sum,"overlapping reads in",tot,"records"}'

There are 1093140 overlapping reads in 5578 records

BEDTools merge






  • No labels