BED (Browser Extensible Data) format is a simple text format for location-oriented data (genomic regions) developed to support UCSC Genome Browser tracks. Standard BED files have 3 to 6 Tab-separated columns, although up to 12 columns are defined. (Read more about the UCSC Genome Browser's official BED format.)
These 6 BED fields are so important that you should memorize them. Keep repeating "chrom, start, end, name, score, strand" until the words trip off your tongue |
Important rules for BED format:
Note that the UCSC Genome Browser also defines many BED-like data formats (e.g. bedGraph, narrowPeak, tagAlign and RNA elements formats). See supported UCSC Genome Browser data formats for more information and examples.
In addition to standard-format BED files, one can create custom BED files that have at least 3 of the standard fields, followed by any number of custom fields. For example:
As we will see, BEDTools functions require BED3+ input files, or BED6+ if strand-specific operations are requested.
The BEDTools suite is a set of utilities for manipulating BED and BAM files. We call it the "Swiss army knife" for genomic region analyses because its sub-commands are so numerous and versatile. Some of the most common bedtools operations perform set-theory functions on regions: intersection (intersect), union (merge), set difference (subtract) – but there are many others. The table below lists some of the most useful sub-commands along with applicable use cases.
Sub-command | Description | Use case(s) |
---|---|---|
bamtobed | Convert BAM files to BED format. | You want to have the contig, start, end, and strand information for each mapped alignment record in separate fields. Recall that the strand is encoded in a BAM flag (0x10) and the exact end coordinate requires parsing the CIGAR string. |
bamtofastq | Extract FASTQ sequences from BAM alignment records. | You have downloaded a BAM file from a public database, but it was not aligned against the reference version you want to use (e.g. it is hg18 and you want an hg38 alignment). To re-process, you need to start with the original FASTQ sequences. |
getfasta | Get FASTA entries corresponding to regions. | Preparation to run motif analysis, which requires the original sequences, on a set of regions of interest. You must provide FASTA file(s) for the genome/reference used for alignment (e.g. the FASTA file used to build the aligner index). |
coverage | Compute genome-wide coverage of your regsions; generate per-base genome-wide signal trace. |
|
multicov | Count overlaps between one or more BAM files and a set of regions of interest. |
|
merge | Combine a set of possibly-overlapping regions into a single set of non-overlapping regions. | Collapse overlapping gene annotations into per-strand non-overlapping regions before counting. If this is not done, the source regions will potentially be counted multiple times, once for each (overlapping) target region it intersects. |
subtract | Remove unwanted regions. | Remove rRNA gene regions from a merged gene annotations file before counting. |
intersect | Determine the overlap between two sets of regions. | Similar to multicov, but takes BED files as input and can also report (not just count) the overlapping regions. |
closest | Find the genomic features nearest to a set of regions. | For a set of significant ChIP-seq transcription factor (TF) binding regions ("peaks") that have been identified, determine nearby genes that may be targets of TF regulation. |
We will explore a few of these functions in our exercises.
BEDTools is under active development and is always being refined and extended. Unfortunately, sometimes changes are made that are incompatible with previous BEDTools versions. For example, a major change to the way bedtool merge functions was made after bedtools v2.17.0.
So it is important to know which version of BEDTools you are using, and read the documentation carefully to see if changes have been made since your version. When you run the code below, you should see that the version on TACC is bedtools v2.25.0. (Login to login5.ls5.tacc.utexas.edu first.)
module load bedtools bedtools --version |
The most important thing to remember about comparing regions using BEDTools, is that all region files must share the same set contig names and be based on the same reference! For example, if an alignment was performed against a human GRCh38 reference genome from Gencode, annotations from the corresponding GFF/GTF annotations.
Annotation files that you retrieve from public databases are often in GTF (Gene Transfer Format) or one of the in GFF (General Feature Format) formats (usually GFF3 these days).
Unfortunately, both formats are obscure and hard to work with directly. While bedtools does accept annotation files in GFF/GTF format, you will not like the results. This is because the most useful information in a GFF/GTF file is in a loosly-structured attributes field.
Also unfortunately, there are a number of variations of both annotation formats However both GTF and GFF share the first 8 fields (Tab-separated):
The Tab-separated columns will care about are (1) seqname, (3) feature_type and (4,5) start, end. The reason we care is that when working with annotations, we usually only want to look at annotations of a particular type, most commonly gene, but also transcript or exon.
So where is the real annotation information, such as the unique gene ID or gene name? Both formats also have a 9th field, which is usually populated by a set of name/value pair attributes, and that's where the useful information is (e.g. the unique feature identifier, name, and so forth).
Take a quick look at a yeast annotation file, sacCer_R64-1-1_20110208.gff using less. (Login to login5.ls5.tacc.utexas.edu first.)
mkdir -p $SCRATCH/core_ngs/bedtools cd $SCRATCH/core_ngs/bedtools cp $CORENGS/yeast_rna/sacCer_R64-1-1_20110208.gff . less sacCer_R64-1-1_20110208.gff |
In addition to comment lines (starting with #), you can see the chrI contig names in column 1 and various feature types in column 3. You see also see tags like Name=YAL067C;gene=SEO1; among the attributes on some records, but in general the attributes column information is really ugly.
To summarize, we have two problems to solve:
One of the first things you want to know about your annotation file is what gene features it contains. Here's how to find that: (Read more about what's going on here at Piping a histogram.)
cd $SCRATCH/core_ngs/bedtools cat sacCer_R64-1-1_20110208.gff | grep -v '^#' | cut -f 3 | sort | uniq -c | sort -k1,1nr | more |
You should see something like this.
7077 CDS 6607 gene 480 noncoding_exon 383 long_terminal_repeat 376 intron 337 ARS 299 tRNA 190 region 129 repeat_region 102 nucleotide_match 89 transposable_element_gene 77 snoRNA 50 LTR_retrotransposon 32 telomere 31 binding_site 27 rRNA 24 five_prime_UTR_intron 21 pseudogene 17 chromosome 16 centromere 15 ncRNA 8 external_transcribed_spacer_region 8 internal_transcribed_spacer_region 6 snRNA 3 gene_cassette 2 insertion |
Let's create a file that contains only the 6607 gene entries:
cat sacCer_R64-1-1_20110208.gff | awk 'BEGIN{FS=OFS="\t"}{ if($3=="gene"){print} }' > sc_genes.gff wc -l sc_genes.gff |
The line count of sc_genes.gff should be 6607 – one for each gene entry.
Our sc_genes.gff annotation subset now contains only the 6607 genes in the Saccharomyces cerevisiae genome. This addresses our first problem, but entries in this file still have the important information – the gene ID and name – in the loosely-structured 9th attributes field.
If we want to associate reads with features, we need to have the feature names where they are easy to extract!
What most folks to is find some way to convert their GFF/GTF file to a BED file, parsing out some (or all) of the name/value attribute pairs into BED file columns after the standard 6. You can find such conversion programs on the web – or write one yourself. Or you could use the BioITeam conversion script, /work/projects/BioITeam/common/script/gtf_to_bed.pl. While it will not work 100% of the time, it manages to do a decent job on most GFF/GTF files. And it's pretty easy to run.
If this script doesn't work on your annotation file, please let Anna know. She is always looking for cases where the conversion fails, and will try to fix it. |
Here we just give the script the GFF file to convert, plus a 1 that tells it to URL decode weird looking text (e.g. our Note attribute values).
/work/projects/BioITeam/common/script/gtf_to_bed.pl sc_genes.gff 1 > sc_genes.converted.bed |
The program reads the input file twice – once to gather all the attribute names, and then a second time to write the attribute values in well-defined columns. You'll see output like this:
---------------------------------------- Gathering all attribute names for GTF 'sc_genes.gff'... urlDecode = 1, tagAttr = tag Done! 6607 lines read 6607 locus entries 8 attributes found: (Alias ID Name Note Ontology_term dbxref gene orf_classification) ---------------------------------------- Writing BED output for GTF 'sc_genes.gff'... Done! Wrote 6607 locus entries from 6607 lines |
To find out what the resulting columns are, look at the header line out the output BED file:
head -1 sc_genes.converted.bed |
For me the resulting 16 attributes are as follows (they may have a different order for you). I've numbered them for convenience
1. chrom 2. start 3. end 4. featureType 5. length 6. strand 7. source 8. frame 9. Alias 10. ID 11. Name 12. Note 13. Ontology_term 14. dbxref 15. gene 16. orf_classification |
The final transformation is to do a bit of re-ordering, dropping some fields. We'll do this with awk, becuase cut can't re-order fields. While this is not strictly required, it can be helpful to have the critical fields (including the gene ID) in the 1st 6 columns. We do this separately for the header line and the rest of the file so that the BED file we give bedtools does not have a header (but we know what those fields are). We would normally preserve valuable annotation information such as Ontology_term, dbxref and Note, but drop them here for simplicity.
head -1 sc_genes.converted.bed | sed 's/\r//' | awk ' BEGIN{FS=OFS="\t"}{print $1,$2,$3,$10,$5,$6,$15,$16} ' > sc_genes.bed.hdr tail -n +2 sc_genes.converted.bed | sed 's/\r//' | awk ' BEGIN{FS=OFS="\t"} { if($15 == "") {$15 = 10} # make sure gene name is populated print $1,$2,$3,$10,$5,$6,$15,$16} ' > sc_genes.bed |
One final detail. Annotation files you download may have non-Unix (linefeed-only) line endings. Specifically, they may use Windows line endings (carriage return + linefeed). (Read about Line ending nightmares.) The expression sed 's/\r//' uses the sed (substitution editor) tool to replace carriage return characters ( \r ) with nothing, removing them from the output.
Finally, the 8 re-ordered attributes are:
1. chrom 2. start 3. end 4. ID 5. length 6. strand 7. gene 8. orf_classification |
**Whew**! That was a lot of work. Welcome to the world of annotation wrangling. It's never pretty! But at least the result is much nicer looking. Examine the results using more or less or head:
cat sc_genes.bed | head -20 |
Doesn't this look better?
chrI 334 649 YAL069W 315 + 10 Dubious chrI 537 792 YAL068W-A 255 + 10 Dubious chrI 1806 2169 YAL068C 363 - PAU8 Verified chrI 2479 2707 YAL067W-A 228 + 10 Uncharacterized chrI 7234 9016 YAL067C 1782 - SEO1 Verified chrI 10090 10399 YAL066W 309 + 10 Dubious chrI 11564 11951 YAL065C 387 - 10 Uncharacterized chrI 12045 12426 YAL064W-B 381 + 10 Uncharacterized chrI 13362 13743 YAL064C-A 381 - 10 Uncharacterized chrI 21565 21850 YAL064W 285 + 10 Verified chrI 22394 22685 YAL063C-A 291 - 10 Uncharacterized chrI 23999 27968 YAL063C 3969 - FLO9 Verified chrI 31566 32940 YAL062W 1374 + GDH3 Verified chrI 33447 34701 YAL061W 1254 + BDH2 Uncharacterized chrI 35154 36303 YAL060W 1149 + BDH1 Verified chrI 36495 36918 YAL059C-A 423 - 10 Dubious chrI 36508 37147 YAL059W 639 + ECM1 Verified chrI 37463 38972 YAL058W 1509 + CNE1 Verified chrI 38695 39046 YAL056C-A 351 - 10 Dubious chrI 39258 41901 YAL056W 2643 + GPB2 Verified |
By default many bedtools utilities that perform overlapping, consider reads overlapping the feature on either strand, but it can be made specific with the -s or -S option. This strandedness options for bedtools utilities refers the orientation of the R1 read with respect to the feature's (gene's) strand.
RNA-seq libraries can be constructed with 3 types of strandedness:
Which type of RNA-seq library you have depends on the library preparation method – so ask your sequencing center! Our yeast RNA-seq library is sense stranded.
If you have a stranded RNA-seq library, you should use either -s or -S to avoid false counting against a gene on the wrong strand.
We're now (finally!) actually going to do some gene-based analyses of a yeast RNA-seq dataset using bedtools and the BED-formatted yeast gene annotation file we created above.
First start an idev session, since we will be doing some significant computation.
idev -p development -m 20 -A UT-2015-05-18 -N 1 -n 24 --reservation=CCBB |
Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).
mkdir -p $SCRATCH/core_ngs/bedtools cd $SCRATCH/core_ngs/bedtools cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* . cp $CORENGS/yeast_rna/sc_genes.bed* . |
Exercises: How many reads pairs are in the yeast_mrna.sort.filt.bam file? How many proper pairs? How many duplicates? What is the distribution of mapping qualities? What is the average mapping quality?
samtools flagstat for the different read counts. samtools view + cut + sort + uniq -c for mapping quality distribution samtools view + awk for average mapping quality |
There are 3323242 total reads, all mapped and all properly paired. So this must be a quality-filtered BAM. There are 922114 duplicates, or about 28%. To get the distribution of mapping qualities:
To compute average mapping quality:
Mapping qualities range from 20 to 60 – excellent quality! Because the majority reads have mapping quality 60, the average is 59. So again, there must have been quality filtering performed on upstream alignment records. |
In this section we'll use bedtools multicov to count RNA-seq reads that overlap our gene features. The bedtools multicov command (http://bedtools.readthedocs.io/en/latest/content/tools/multicov.html) takes a feature file (GFF/BED/VCF) and counts how many reads from one or more input BAM files overlap those feature. The input BAM file(s) must be position-sorted and indexed.
So down to business! Here's how to run bedtools multicov, directing the standard output to a file:
module load bedtools bedtools multicov -s -bams yeast_mrna.sort.filt.bam -bed sc_genes.gff > yeast_mrna_gene_counts.bed |
Exercise: How may records of output were written? Where is the count of overlaps per output record?
xxx |
x
BEDTools is a great utility for comparing genomic features in BAM, BED, VCF, and GFF formats. The documentation is well written in great detail. Here, we will practice three commonly used sub-commands: multicov, merge, and intersect.
We are going to use it to count the number of reads that map to each gene in the genome. Load the module and check out the help for bedtools and the multicov specific command that we are going to use:
bedtools bedtools multicov |
The multicov command takes a feature file (GFF/BED/VCF) and counts how many reads are in certain regions from many input files. By default it counts how many reads overlap the feature on either strand, but it can be made specific with the -s
option.
Note: Remember that the chromosome names in your gff file should match the way the chromosomes are named in the reference fasta file used in the mapping step. For example, if BAM file used for mapping contains chr1, chrX etc, the GFF file must also call the chromosomes as chr1, chrX and so on.
In order to use the bedtools command on our data, do the following:
bedtools multicov -bams yeast_chip_sort.bam yeast_chip2_sort.bam -bed sc_gene.bed > gene_counts.txt |
Then take a peek at the data...
head gene_counts.txt |