Page History

...

Sub-command	Description	Use case(s)
bamtobed	Convert BAM files to BED format.	You want to have the contig, start, end, and strand information for each mapped alignment record in separate fields. Recall that the strand is encoded in a BAM flag (0x10) and the exact end coordinate requires parsing the CIGAR string.
bamtofastq	Extract FASTQ sequences from BAM alignment records.	You have downloaded a BAM file from a public database, but it was not aligned against the reference version you want to use (e.g. it is hg18 hg19 and you want an hg38 alignment). To re-process, you need to start with the original FASTQ sequences.
getfasta	Get FASTA entries corresponding to regions.	You want to run motif analysis, which requires the original FASTA sequences, on a set of regions of interest. In addition to the BAM file, you must provide FASTA file(s) for the genome/reference used for alignment (e.g. the FASTA file used to build the aligner index).
coverage	Compute genome-wide coverage of your regions ; generate Generate per-base genome-wide signal trace .	You have performed a WGS (whole genome sequencing) experiment and want to know if has resulted in the desired coverage depth. Calculate what proportion of the (known) transcriptome is covered by your RNA-seq alignments. Provide the transcript regions as a BED or GFF/GTF file. Produce a per-base genome-wide signal (in bedGraph format) for a ChIP-seq or ATAC-seq experiment. After conversion to binary bigWig format, such tracks can be configured in the UCSC Genome Browser as custom tracks.
multicov	Count overlaps between one or more BAM files and a set of regions of interest.	Count RNA-seq alignments that overlap a set of genes of interest. While this task is usually done with a specialized RNA-seq quantification tool (e.g. featureCounts or HTSeq), bedtools multicov can provide a quick estimate, e.g. for QC purposes.
merge	Combine a set of possibly-overlapping regions into a single set of non-overlapping regions.	Collapse overlapping gene annotations into per-strand non-overlapping regions before counting (e.g with featureCounts or HTSeq). If this is not done, the source regions will potentially be counted multiple times, once for each (overlapping) target region it intersects.
subtract	Remove unwanted regions.	Remove rRNA gene regions from a merged gene annotations file before counting.
intersect	Determine the overlap between two sets of regions.	Similar to multicov, but takes BED files as input and can also report (not just count) the overlapping regions.
closest	Find the genomic features nearest to a set of regions.	For a set of significant ChIP-seq transcription factor (TF) binding regions ("peaks") that have been identified, determine nearby genes that may be targets of TF regulation.

...

Expand

title	Setup (if needed)

Code Block

language	bash

idev -p development -m 120 -A UT-2015-05-18 -N 1 -n 2468 --reservation=BIO_DATA_week_1
module load biocontainers
module load samtools
module load bedtools

mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/*.gff .
cp $CORENGS/yeast_rna/sc_genes.bed* .
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .

...

Expand

title	Answer

Code Block

language	bash

grep -v 'Dubious' yeast_mrna_gene_counts.bed | awk '
 BEGIN{FS="\t";sum=0;tot=0}
 {if($9 > 0) { sum = sum + $9; tot = tot + 1 }}
 END{print sum,"printf("%d overlapping reads in",tot," %d non-Dubious genes\n", sum, tot) }'

There are 1093140 overlapping reads in 5578 non-Dubious genes

...

The output from bedtools merge always starts with either 3 or 4 columns: chrom, start and end of the merged region only

...

.

Using the -c (column) and -o (operation) options, you can have information added in subsequent fields. Each comma-separated column number following -c specifies a column to operate on, and the corresponding comma-separated function name following the -o specifies the operation to perform on that column in order to produce an additional output field.

For example, our sc_genes.bed file has a gene name in column 4, and for each (possibly merged) gene region, we want to know the number of gene regions that were collapsed into the region, and also which gene names were collapsed.

We can do this with -c 6,4,4 -o distinct,count,collapse, which says that two three custom output columns should be added:

the 1st custom column should result from counting the gene names in column 4 for all genes that were merged, andcollapsing distinct (unique) values of gene file column 6 (the strand, + or -)
- since we will ask for stranded merging, the merged regions will always be on the same strand, so this value will always be + or -
the 2nd custom output column should result from counting the gene names in column 4 for all genes that were merged, and
the 3rd custom output the 2nd should be a comma-separated collapsed list of those same column 4 gene names

...

Expand

title	Setup (if needed)

Code Block

language	bash

mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rna/*.gff .
cp $CORENGS/yeast_rna/sc_genes.bed* .
cp $CORENGS/yeast_rna/yeast_mrna.sort.filt.bam* .
module load biocontainers
module load bedtools

Code Block

language	bash
title	Use bedtools merge to collapse overlapping gene annotations

cd $SCRATCH/core_ngs/bedtools
sort -k1,1 -k2,2n sc_genes.bed > sc_genes.sorted.bed
bedtools merge -i sc_genes.sorted.bed -s -c 6,4,4 -o distinct,count,collapse > merged.sc_genes.txt

The first few lines of the merged.sc_genes.txt file look like this (I've tidied it up a bit):

Code Block

2-micron        251     1523    +       1       R0010W
2-micron        1886    3008    -       1       R0020C
2-micron        3270    3816    +       1       R0030W
2-micron        5307    6198    -       1       R0040C
chrI            334     792     +       2       YAL069W,YAL068W-A
chrI            1806    2169    -       1       YAL068C
chrI            2479    2707    +       1       YAL067W-A
chrI            7234    9016    -       1       YAL067C
chrI            10090   10399   +       1       YAL066W
chrI            11564   11951   -       1       YAL065C

Output column 4 has the region's strand(since we asked for a strand-specific merge). Column 5 is the count of merged regions, and column 6 is a comma-separated list of the merged gene names.

...

Expand

title	Answer

Code Block

language	bash

wc -l sc_genes.bed merged.sc_genes.txt

There were 6485 6607 genes before merging and 6485 after.

...

Expand

title	Answer

Code Block

language	bash

cd $SCRATCH/core_ngs/bedtools
grep -v 'Dubious' sc_genes.bed > good.sc_genes.bed

sort -k1,1 -k2,2n good.sc_genes.bed > good.sc_genes.sorted.bed
bedtools merge -i good.sc_genes.sorted.bed -s \
  -c 6,4,4 -o distinct,count,collapse > merged.good.sc_genes.txt

wc -l good.sc_genes.bed merged.good.sc_genes.txt

There were 5797 "good" (non-Dubious) genes before merging and 5770 after.

Code Block

language	bash

cut -f 5 merged.good.sc_genes.txt | sort | uniq -c | sort -k2,2n

Produces this histogram:

Code Block

language	bash

Now there are only 20 regions where more than one gene was collapsed. Clearly eliminating the Dubious ORFs helped.

...

Page tree

Versions Compared

Old Version 55

New Version 56

Key