Page History

...

Sub-command	Description	Use case(s)
bamtobed	Convert BAM files to BED format.	You want to have the contig, start, end, and strand information for each mapped alignment record in separate fields. Recall that the strand is encoded in a BAM flag (0x10) and the exact end coordinate requires parsing the CIGAR string.
bamtofastq	Extract FASTQ sequences from BAM alignment records.	You have downloaded a BAM file from a public database, but it was not aligned against the reference version you want to use (e.g. it is hg19 and you want an hg38 alignment). To re-process, you need to start with the original FASTQ sequences.
getfasta	Get FASTA entries corresponding to regions.	You want to run motif analysis, which requires the original FASTA sequences, on a set of regions of interest. In addition to the BAM file, you must provide FASTA file(s) for the genome/reference used for alignment (e.g. the FASTA file used to build the aligner index).
genomecov	Generate per-base	coverage	Compute genome-wide coverage of your regionsGenerate signal trace	Produce a per-base genome-wide signaltrace(in bedGraph format), for example for a ChIP-seq or ATAC-seq experiment. After conversion to binary bigWig format, such tracks can be visualized in the Broad's IGV (Integrative Genome Browser) application, or configured in the UCSC Genome Browser as custom tracks.
coverage	Compute coverage of your regions	You have performed a WGS (whole genome sequencing) experiment and want to know if has resulted in the desired coverage depth. Calculate what proportion of the (known) transcriptome is covered by your RNA-seq alignments. Provide the transcript regions as a BED or GFF/GTF file. Produce a per-base genome-wide signal (in bedGraph format) for a ChIP-seq or ATAC-seq experiment. After conversion to binary bigWig format, such tracks can be configured in the UCSC Genome Browser as custom tracks.
multicov	Count overlaps between one or more BAM files and a set of regions of interest.	Count RNA-seq alignments that overlap a set of	multicov	Count overlaps between one or more BAM files and a set of regions of interest.	Count RNA-seq alignments that overlap a set of genes of interest. While this task is usually done with a specialized RNA-seq quantification tool (e.g. featureCounts or HTSeq), bedtools multicov can provide a quick estimate, e.g. for QC purposes.
merge	Combine a set of possibly-overlapping regions into a single set of non-overlapping regions.	Collapse overlapping gene annotations into per-strand non-overlapping regions before counting (e.g with featureCounts or HTSeq). If this is not done, the source regions will potentially be counted multiple times, once for each (overlapping) target region it intersects.
subtract	Remove unwanted regions.	Remove rRNA gene regions from a merged gene annotations file before counting.
intersect	Determine the overlap between two sets of regions.	Similar to multicov, but can also report (not just count) the overlapping regions.
closest	Find the genomic features nearest to a set of regions.	For a set of significant ChIP-seq transcription factor (TF) binding regions ("peaks") that have been identified, determine nearby genes that may be targets of TF regulation.

...

Expand

title	Answer

Code Block

language	bash

cat yeast_mrna_gene_counts.bed | awk '
 BEGIN{FS="\t";sum=0;tot=0}
 {if($7 > 0) { sum = sum + $7; tot = tot + 1 }}
 END{printf("%d overlapping reads in %d genes\n", sum, tot) }'

There are 1,152,831 overlapping reads in 6,141 non-0 gene annotations.

Use bedtools

...

genomecov to create a signal track

A signal track is a bedGraph (BED3+) file with an entry for every base in a defined set of regions (see https://genome.ucsc.edu/goldenpath/help/bedgraph.html). bedGraph files can be visualized in the Broad's IGV (Integrative Genomics Viewer) application (https://software.broadinstitute.org/software/igv/download) or in the UCSC Genome Browser (https://genome.ucsc.edu/).

The bedtools coveragegenomecov function (https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html), with the -d (per-base depthbg (bedgraph) option produces output that can be made into a bedGraphin bedGraph format. Here we'll analyze the per-base coverage of yeast RNAseq reads in our merged yeast gene regions.

...

Code Block

language	bash
title	Prepare for bedtools coverage

idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5
module load biocontainers
module load bedtools

mkdir -p $SCRATCH/core_ngs/bedtools_coveragegenomecov
cp $CORENGS/catchup/bedtools_merge/merged*bed      $SCRATCH/core_ngs/bedtools_coveragegenomecov/
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* $SCRATCH/core_ngs/bedtools_coveragegenomecov/

Then calling bedtools coveragegenomecov is easy. The "A" file will be our gene regions, and the "B" file will be the yeast RNAseq reads. We also use the -d (per-base depth) and -s (force "strandedness") options. -bg option says to report the depth in bedGraph format.

Code Block

language	bash

cds; cd $SCRATCH/core_ngs/bedtools_coveragegenomecov
bedtools coveragegenomecov -sbg -d -a merged.good.sc_genes.bed -b ibam yeast_mrna.sort.filt.bam > yeast_mrna.gene_coveragegenomecov.txtbedGraph

wc -l yeast_mrna.gene_coveragegenomecov.txtbedGraph # 8,829,3171519274 lines!

It will complain a bit because our genes file includes the yeast plasmid "2-micron" but the RNAseq BAM doesn't include that contig. We'll ignore that warning.

The bedtools coverage output is a bit strange. It lists each region in the A file, followed by information from the B reads. Here the column order will be gene_chrom gene_start gene_end gene_name gene_score gene_strand offset_in_the_gene_region read_overlap count.

Let's look at coverage for gene YAL067C:

Code Block

language	bash

cat yeast_mrna.gene_coverage.txt | grep -P 'YAL067C' | head -50

Will look like this:

Code Block

chrI    7234    9016    YAL067C 1       -       1       0
chrI    7234    9016    YAL067C 1       -       2       0
chrI    7234    9016    YAL067C 1       -       3       0
chrI    7234    9016    YAL067C 1       -       4       0
chrI    7234    9016    YAL067C 1       -       5       0
chrI    7234    9016    YAL067C 1       -       6       0
chrI    7234    9016    YAL067C 1       -       7       0
chrI    7234    9016    YAL067C 1       -       8       0
chrI    7234    9016    YAL067C 1       -       9       0
chrI    7234    9016    YAL067C 1       -       10      0
chrI    7234    9016    YAL067C 1       -       11      0
chrI    7234    9016    YAL067C 1       -       12      0
chrI    7234    9016    YAL067C 1       -       13      0
chrI    7234    9016    YAL067C 1       -       14      0
chrI    7234    9016    YAL067C 1       -       15      0
chrI    7234    9016    YAL067C 1       -       16      0
chrI    7234    9016    YAL067C 1       -       17      1
chrI    7234    9016    YAL067C 1       -       18      1
chrI    7234    9016    YAL067C 1       -       19      1
chrI    7234    9016    YAL067C 1       -       20      1
chrI    7234    9016    YAL067C 1       -       21      1
chrI    7234    9016    YAL067C 1       -       22      1
chrI    7234    9016    YAL067C 1       -       23      1
chrI    7234    9016    YAL067C 1       -       24      1
chrI    7234    9016    YAL067C 1       -       25      1
chrI    7234    9016    YAL067C 1       -       26      1
chrI    7234    9016    YAL067C 1       -       27      1
chrI    7234    9016    YAL067C 1       -       28      1
chrI    7234    9016    YAL067C 1       -       29      1
chrI    7234    9016    YAL067C 1       -       30      1
chrI    7234    9016    YAL067C 1       -       31      1
chrI    7234    9016    YAL067C 1       -       32      1
chrI    7234    9016    YAL067C 1       -       33      1
chrI    7234    9016    YAL067C 1       -       34      1
chrI    7234    9016    YAL067C 1       -       35      1
chrI    7234    9016    YAL067C 1       -       36      1
chrI    7234    9016    YAL067C 1       -       37      1
chrI    7234    9016    YAL067C 1       -       38      2
chrI    7234    9016    YAL067C 1       -       39      2
chrI    7234    9016    YAL067C 1       -       40      2
chrI    7234    9016    YAL067C 1       -       41      3
chrI    7234    9016    YAL067C 1       -       42      3
chrI    7234    9016    YAL067C 1       -       43      3
chrI    7234    9016    YAL067C 1       -       44      3
chrI    7234    9016    YAL067C 1       -       45      4
chrI    7234    9016    YAL067C 1       -       46      4
chrI    7234    9016    YAL067C 1       -       47      4
chrI    7234    9016    YAL067C 1       -       48      4
chrI    7234    9016    YAL067C 1       -       49      4
chrI    7234    9016    YAL067C 1       -       50      4

A proper bedGraph file has only 4 columns: chrom start end value and does not need to include positions with 0 reads, so we'll convert the bedtools coverage output to bedGraph using awk. We re-sort the output so that plus and minus strand positions are adjacent.

Code Block

language	bash

cat yeast_mrna.gene_coverage.txt | awk '
BEGIN{FS=OFS="\t"}
{if ($8>0) {print $1,$2-1+$7,$2+$7,$8}}' | \
  sort -k1,1 -k2,2n -k3,3n > yeast_mrna.gene_coverage.almost.bedGraph

wc -l yeast_mrna.gene_coverage.almost.bedGraph  # 5,710,186 -- better, but still big

While we probably could consider this file to have bedGraph format, it's preferable to combine adjacent per-base coordinates with the same count into larger regions, e.g.

Code Block

# per-base counts
chrI    7271    7272    2
chrI    7272    7273    2
chrI    7273    7274    2
chrI    7274    7275    3
chrI    7275    7276    3
chrI    7276    7277    3
chrI    7277    7278    3

# corresponding region counts
chrI    7271    7274    6
chrI    7274    7278    12

Here's some awk to do this:

Code Block

language	bash

cat yeast_mrna.gene_coverage.almost.bedGraph | awk '
BEGIN{FS=OFS="\t"; chr=""; start=-1; end=-1; count=0}
{if (chr != $1) { # new contig; finish previous
   if (count > 0) { print chr,start,end,count }
   chr=$1; start=$2; end=$3; count=$4
 } else if (($2==end || $2==end+1) && ($4==count)) { # same or adjacent position with same count
   end=$3;
 } else { # new region on same contig; finish prev
   if (count > 0) { print chr,start,end,count}
   start=$2; end=$3; count=$4
 }
}
END{ # finish last
  if (count > 0) { print chr,start,end,count }
}' > yeast_mrna.gene_coverage.bedGraph

wc -l yeast_mrna.gene_coverage.bedGraph  # 1,048,510 -- much better!

Make sure the total counts match!

Code Block

language	bash

cat yeast_mrna.gene_coverage.txt | awk '
  BEGIN{tot=0}{tot=tot+$8}END{print tot}'          # should be 86703686 
cat yeast_mrna.gene_coverage.almost.bed | awk '
  BEGIN{tot=0}{tot=tot+$4}END{print tot}'          # should also be 86703686 
cat yeast_mrna.gene_coverage.bedGraph | awk '
  BEGIN{tot=0}{tot=tot+$4*($3-$2)}END{print tot}'  # should also be 86703686

Now our yeast_mrna.gene_coverage.bedGraph file is a proper bedGraph, whose first lines look like this:

The bedGraph (BED3+) format has only 4 columns: chrom start end value and does not need to include positions with 0 reads. Here the count is the number of reads covering each base in the region given by chrom start end, as you can see looking at the first few lines with head:

Code Block

chrI    4348    4390    2
chrI    4390    4391    1
chrI    4745    4798    2
chrI    4798    4799    1
chrI    4949    4957    2
chrI    4957    4984    4
chrI    4984    4997    6
chrI    4997    4998    5
chrI    4998    5005    4
chrI    5005    5044    2
chrI    5044    5045    1
chrI    6211    6268    2
chrI    6268    6269    1
chrI    7250    7257    3
chrI    7257    7271    4
chrI    7271    7274    6
chrI    7274    7278    7
chrI    7278    7310    8
chrI    7310    7315    6
chrI    7315    7317    5

Because this bedGraph file is for the small-ish (12Mb) yeast genome, and for reads that cover only part of that genome, it is not too big – only ~34M. But depending on the species and read depth, bedGraph files can get very large, so there is a coresponding binary format called bigWig (see https://genome.ucsc.edu/goldenpath/help/bigWig.html). The program to covert a bedGraph file to bigWig format is part of the UCSC Tools suite of programs. Look for it with module spider, and note that you can get information about all the tools in it using module spider with a specific container version:

Code Block

# look for the ucsc tools package
module spider ucsc

# specifying a specific container version will show more information about the package
module spider ucsc_tools/ctr-357--0

# displays information including the programs in the package:
  - bedGraphToBigWig
  - bedToBigBed
  - faToTwoBit
  - liftOver
  - my_print_defaults
  - mysql_config
  - nibFrag
  - perror
  - twoBitToFa
  - wigToBigWig

Looking at the help for bedGraphToBigWig, we'll need a file of chromosome sizes. We can create one from our BAM header, using a Perl substitution script, which I prefer to sed (see Tips and tricks#perlpatternsubstitution):

Code Block

language	bash

module load ucsc_tools

cd $SCRATCH/core_ngs/bedtools_genomecov
bedGraphToBigWig  # look at its usage

# create the needed chromosome sizes file from our BAM header
module load samtools
samtools view -H yeast_mrna.sort.filt.bam | grep -P 'SN[:]' | \
  perl -pe 's/.*SN[:]//' | perl -pe 's/LN[:]//' > sc_chrom_sizes.txt

cat sc_chrom_sizes.txt

# displays:
chrI    230218
chrII   813184
chrIII  316620
chrIV   1531933
chrV    576874
chrVI   270161
chrVII  1090940
chrVIII 562643
chrIX   439888
chrX    745751
chrXI   666816
chrXII  1078177
chrXIII 924431
chrXIV  784333
chrXV   1091291
chrXVI  948066
chrM    85779

Finally, call bedGraphToBigWig after sorting the bedGraph file again using the sort format bedGraphToBigWig likes. (You can try calling bedGraphToBigWig without sorting to see the error).

Code Block

language	bash

cd $SCRATCH/core_ngs/bedtools_genomecov
export LC_COLLATE=C; sort -k1,1 -k2,2n -k3,3n yeast_mrna.genomecov.bedGraph > yeast_mrna.genomecov.sorted.bedGraph
bedGraphToBigWig yeast_mrna.genomecov.bedGraph sc_chrom_sizes.txt yeast_mrna.genomecov.bw

Just like zcat converts gzip'd files to text, and samtools view convets binary BAM files to text, the bigWigToBedGraph program can convert binary bigWig format to text. Unfortunately, the ucsc-bigwigtobedgraph BioContainer seems to be broken, so we'll use a version in the BioITeam area instead:

Code Block

language	bash

cd $SCRATCH/core_ngs/bedtools_genomecov
# see usage for bigWigToBedGraph:
/work/projects/BioITeam/common/opt/UCSC_utils.2019_08/bigWigToBedGraph

# use the program to view a few lines of the binary bigWig file
/work/projects/BioITeam/common/opt/UCSC_utils.2019_08/bigWigToBedGraph \
  yeast_mrna.genomecov.bw stdout | head

Code Block

chrI    7250    7271    1
chrI    7271    7274    2
chrI    7274    7278    3
chrI    7278    7310    4
chrI    7310    7317    3
chrI    7317    7349    2
chrI    7349    7353    1
chrI    7500    7556    1
chrI    8851    8891    1
chrI    11919   11951   1

Page tree

Versions Compared

Old Version 77

New Version 78

Key

Use bedtools

genomecov to create a signal track