...
Sub-command | Description | Use case(s) |
---|---|---|
bamtobed | Convert BAM files to BED format. | You want to have the contig, start, end, and strand information for each mapped alignment record in separate fields. Recall that the strand is encoded in a BAM flag (0x10) and the exact end coordinate requires parsing the CIGAR string. |
bamtofastq | Extract FASTQ sequences from BAM alignment records. | You have downloaded a BAM file from a public database, but it was not aligned against the reference version you want to use (e.g. it is hg18 hg19 and you want an hg38 alignment). To re-process, you need to start with the original FASTQ sequences. |
getfasta | Get FASTA entries corresponding to regions. | You want to run motif analysis, which requires the original FASTA sequences, on a set of regions of interest. In addition to the BAM file, you must provide FASTA file(s) for the genome/reference used for alignment (e.g. the FASTA file used to build the aligner index). |
coverage |
|
|
multicov | Count overlaps between one or more BAM files and a set of regions of interest. |
|
merge | Combine a set of possibly-overlapping regions into a single set of non-overlapping regions. | Collapse overlapping gene annotations into per-strand non-overlapping regions before counting (e.g with featureCounts or HTSeq). If this is not done, the source regions will potentially be counted multiple times, once for each (overlapping) target region it intersects. |
subtract | Remove unwanted regions. | Remove rRNA gene regions from a merged gene annotations file before counting. |
intersect | Determine the overlap between two sets of regions. | Similar to multicov, but takes BED files as input and can also report (not just count) the overlapping regions. |
closest | Find the genomic features nearest to a set of regions. | For a set of significant ChIP-seq transcription factor (TF) binding regions ("peaks") that have been identified, determine nearby genes that may be targets of TF regulation. |
...
Expand | |||||
---|---|---|---|---|---|
| |||||
|
...
Expand | |||||
---|---|---|---|---|---|
| |||||
There are 1093140 overlapping reads in 5578 non-Dubious genes |
...
The output from bedtools merge always starts with either 3 or 4 columns: chrom, start and end of the merged region only
...
.
Using the -c (column) and -o (operation) options, you can have information added in subsequent fields. Each comma-separated column number following -c specifies a column to operate on, and the corresponding comma-separated function name following the -o specifies the operation to perform on that column in order to produce an additional output field.
For example, our sc_genes.bed file has a gene name in column 4, and for each (possibly merged) gene region, we want to know the number of gene regions that were collapsed into the region, and also which gene names were collapsed.
We can do this with -c 6,4,4 -o distinct,count,collapse, which says that two three custom output columns should be added:
- the 1st custom column should result from counting the gene names in column 4 for all genes that were merged, andcollapsing distinct (unique) values of gene file column 6 (the strand, + or -)
- since we will ask for stranded merging, the merged regions will always be on the same strand, so this value will always be + or -
- the 2nd custom output column should result from counting the gene names in column 4 for all genes that were merged, and
- the 3rd custom output the 2nd should be a comma-separated collapsed list of those same column 4 gene names
...
Expand | |||||
---|---|---|---|---|---|
| |||||
|
Code Block | ||||
---|---|---|---|---|
| ||||
cd $SCRATCH/core_ngs/bedtools sort -k1,1 -k2,2n sc_genes.bed > sc_genes.sorted.bed bedtools merge -i sc_genes.sorted.bed -s -c 6,4,4 -o distinct,count,collapse > merged.sc_genes.txt |
The first few lines of the merged.sc_genes.txt file look like this (I've tidied it up a bit):
Code Block |
---|
2-micron 251 1523 + 1 R0010W 2-micron 1886 3008 - 1 R0020C 2-micron 3270 3816 + 1 R0030W 2-micron 5307 6198 - 1 R0040C chrI 334 792 + 2 YAL069W,YAL068W-A chrI 1806 2169 - 1 YAL068C chrI 2479 2707 + 1 YAL067W-A chrI 7234 9016 - 1 YAL067C chrI 10090 10399 + 1 YAL066W chrI 11564 11951 - 1 YAL065C |
Output column 4 has the region's strand(since we asked for a strand-specific merge). Column 5 is the count of merged regions, and column 6 is a comma-separated list of the merged gene names.
...
Expand | |||||
---|---|---|---|---|---|
| |||||
There were 6485 6607 genes before merging and 6485 after. |
...
Expand | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||
There were 5797 "good" (non-Dubious) genes before merging and 5770 after.
Produces this histogram:
Now there are only 20 regions where more than one gene was collapsed. Clearly eliminating the Dubious ORFs helped. |
...