Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Tip
titleReservations

Use our summer school reservation (CoreNGSday5CoreNGS-Fri) when submitting batch jobs to get higher priority on the ls6 normal queue today:

sbatch --reservation=CoreNGSday5 CoreNGS-Fri <batch_file>.slurm
idev -m 180 -N 1 -A OTH21164 -r CoreNGSday5CoreNGS-Fri

Table of Contents

The BED format

...

Sub-commandDescriptionUse case(s)
bamtobedConvert BAM files to BED format.You want to have the contig, start, end, and strand information for each mapped alignment record in separate fields. Recall that the strand is encoded in a BAM flag (0x10) and the exact end coordinate requires parsing the CIGAR string.
bamtofastqExtract FASTQ sequences from BAM alignment records.You have downloaded a BAM file from a public database, but it was not aligned against the reference version you want to use (e.g. it is hg19 and you want an hg38 alignment). To re-process, you need to start with the original FASTQ sequences.
getfastaGet FASTA entries corresponding to regions.You want to run motif analysis, which requires the original FASTA sequences, on a set of regions of interest.  In addition to the BAM file, you must provide FASTA file(s) for the genome/reference used for alignment (e.g. the FASTA file used to build the aligner index).
genomecov
  • Generate per-base
coverage
  • Compute genome-wide coverage of your regionsGenerate signal trace
  • Produce a per-base genome-wide signaltrace(in bedGraph format), for example for a ChIP-seq or ATAC-seq experiment. After conversion to binary bigWig format, such tracks can be visualized in the Broad's IGV (Integrative Genome Browser) application, or configured in the UCSC Genome Browser as custom tracks.
coverage
  • Compute coverage of your regions
  • You have performed a WGS (whole genome sequencing) experiment and want to know if has resulted in the desired coverage depth.
  • Calculate what proportion of the (known) transcriptome is covered by your RNA-seq alignments. Provide the transcript regions as a BED or GFF/GTF file.
  • Produce a per-base genome-wide signal (in bedGraph format) for a ChIP-seq or ATAC-seq experiment. After conversion to binary bigWig format, such tracks can be configured in the UCSC Genome Browser as custom tracks.
multicovCount overlaps between multicovCount overlaps between one or more BAM files and a set of regions of interest.
  • Count RNA-seq alignments that overlap a set of genes of interest. While this task is usually done with a specialized RNA-seq quantification tool (e.g. featureCounts or HTSeq), bedtools multicov can provide a quick estimate, e.g. for QC purposes.
mergeCombine a set of possibly-overlapping regions into a single set of non-overlapping regions.Collapse overlapping gene annotations into per-strand non-overlapping regions before counting (e.g with featureCounts or HTSeq). If this is not done, the source regions will potentially be counted multiple times, once for each (overlapping) target region it intersects.
subtractRemove unwanted regions.Remove rRNA gene regions from a merged gene annotations file before counting.
intersectDetermine the overlap between two sets of regions.Similar to multicov, but can also report (the overlapping regions, not just count ) the overlapping regionsthem.
closestFind the genomic features nearest to a set of regions.For a set of significant ChIP-seq transcription factor (TF) binding regions ("peaks") that have been identified, determine nearby genes that may be targets of TF regulation.

...

Code Block
languagebash
titleStart an idev session
idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5CoreNGS-Fri
# 
# ...or
idev -m 90 -N 1 -A OTH21164 -p development

module load biocontainers
module load bedtools
bedtools --version   # should be bedtools v2.27.1

...

  • Most BEDTools functions now accept either BAM or BED files as input. 
    • BED format files must be BED3+, or BED6+ if strand-specific operations are requested.
  • When comparing against a set of regions, those regions are usually supplied in either BED or GTF/GFF.
  • All text-format input files (BED, GTF/GFF, VCF) should use Unix line endings (linefeed only).

The most important thing to remember about comparing regions using BEDTools, is that all input files must share the same set of contig names and be based on the same reference! For example, if an alignment was performed against a human GRCh38 reference genome from Gencode, use annotations from the corresponding GFF/GTF annotations.

...

By default many bedtools utilities that perform overlapping, consider reads overlapping the feature on either strand, but can be made strand-specific with the -s or -S option. This strandedness options for bedtools utilities refers the orientation of the R1 read with respect to the feature's (gene's) strand.

  • -s says the R1 read is sense stranded (on the same strand as the gene).
  • -S says the R1 read is antisense stranded (the opposite strand as the gene).

...

  1. seqname - The name of the chromosome or scaffoldcontig.
  2. source - Name of the program that generated this feature, or other data source (e.g. database)
  3. feature_type - Type of the feature. Examples of common feature types include:, for example:
    • CDS (
    • Some examples of common feature types are:CDS (coding sequence), exon
    • gene, transcript
    • start_codon, stop_codon
  4. start - Start position of the feature, with sequence numbering starting at 1.
  5. end - End position of the feature, with sequence numbering starting at 1.
  6. score - A numeric value. Often but not always an integer.
  7. strand - Defined as + (forward), - (reverse), or . (no relevant strand)
  8. frame - For a CDS, one of 0, 1 or 2, specifying the reading frame of the first base; otherwise '.'

...

Expand
titleMake sure you're in an idev session


Code Block
languagebash
titleStart an idev session
idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5CoreNGS-Fri
# ...or
idev -m 90 -N 1 -A OTH21164 -p development

module load biocontainers
module load bedtools
bedtools --version   # should be bedtools v2.27.1


...

Code Block
languagebash
titleLook at GFF annotation entries with less
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools 
cp $CORENGS/yeast_rnaseq/sacCer3.R64-1-1_20110208.gffyeast_mrna.sort.filt.bam* .  
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .catchup/references/gff/sacCer3.R64-1-1_20110208.gff . 

# Use the less pager to look at multiple lines
less sacCer3.R64-1-1_20110208.gff

# Look at just the most-important Tab-separated columns
cat sacCer3.R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3-5 | head -20

# Include the ugly 9th column where attributes are stored
cat sacCer3.R64-1-1_20110208.gff | grep -v '#' | cut -f 1,3,9 | head

...

One of the first things you want to know about your annotation file is what gene features it contains. Here's how to find that: (Read more about what's going on here at Piping piping a histogram.)

Expand
titleSetup (if needed)


Code Block
languagebash
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rnaseq/catchup/references/gff/sacCer3.R64-1-1_20110208.gff . 



Code Block
languagebash
titleCreate a histogram of all the feature types in a GFF
cd $SCRATCH/core_ngs/bedtools
cat sacCer3.R64-1-1_20110208.gff | grep -v '^#' | cut -f 3 | \
  sort | uniq -c | sort -k1,1nr | more

...

Expand
titleSetup (if needed)


Code Block
languagebash
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeast_rnaseq/*.gff .catchup/references/gff/sacCer3.R64-1-1_20110208.gff .  



Code Block
languagebash
titleConvert GFF to BED with BioITeam script
/work/projects/BioITeam/common/script/gtf_to_bed.pl sc_genes.gff 1 \
  > sc_genes.converted.bed

...

Expand
titleSetup (if needed)


Code Block
languagebash
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeastcatchup/bedtools_rnaseqmerge/*.gff .  
cp $CORENGS/catchup/yeastbedtools_rnaseqmerge/sc_genes.converted.bed .



Code Block
languagebash
titleRe-order the final BED fields
head -1 sc_genes.converted.bed | sed 's/\r//' | awk '
 BEGIN{FS=OFS="\t"}{print $1,$2,$3,$10,$5,$6,$15,$16}
 ' > sc_genes.bed.hdr

tail -n +2 sc_genes.converted.bed | sed 's/\r//' | awk '
 BEGIN{FS=OFS="\t"}
 { if($15 == "") {$15 = $10} # make sure gene name is populated
   print $1,$2,$3,$10,$5,$6,$15,$16}
 ' > sc_genes.bed

...

Expand
titleSetup (if needed)


Code Block
languagebash
mkdir -p $SCRATCH/core_ngs/bedtools
cd $SCRATCH/core_ngs/bedtools
cp $CORENGS/yeastcatchup/bedtools_rnaseqmerge/*.gff .  
cp $CORENGS/catchup/yeastbedtools_rnaseqmerge/sc_genes* .


Exercise: How many genes in our sc_genes.bed file are in each category?

...

Code Block
languagebash
titleStart an idev session
idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5CoreNGS-Fri
# or
idev -m 90 -N 1 -A OTH21164 -p development

Copy over the yeast RNA-seq files we'll need (also copy the GFF gene annotation file if you didn't make one).

...

Expand
titleSetup (if needed)


Code Block
languagebash
idev -m 120 -N 1 -A OTH21164 -r CoreNGSday5
module load biocontainers
module load samtools
module load bedtools

mkdir -p $SCRATCH/core_ngs/bedtools_multicov
cd $SCRATCH/core_ngs/bedtools_multicov
cp $CORENGS/catchup/bedtools_merge/merged*bed      $SCRATCH/core_ngs/bedtools_multicov/.
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* $SCRATCH/core_ngs/bedtools_multicov/.



Code Block
languagebash
titleRun bedtools multicov to count BAM alignments overlapping a set of genes
cd $SCRATCH/core_ngs/bedtools_multicov
bedtools multicov -s -bams yeast_mrna.sort.filt.bam \
  -bed merged.good.sc_genes.bed > yeast_mrna_gene_counts.bed

...

Expand
titleAnswer


Code Block
languagebash
cat yeast_mrna_gene_counts.bed | awk '
 BEGIN{FS="\t";sum=0;tot=0}
 {if($7 > 0) { sum = sum + $7; tot = tot + 1 }}
 END{printf("%d overlapping reads in %d genes\n", sum, tot) }'

There are 1,152,831 overlapping reads in 6,141 non-0 gene annotations.

Use bedtools genomecov to create a signal track

A signal track is a bedGraph (BED3+) file with an entry for every base in a defined set of regions that shows the count of overlapping bases for the regions (see https://genome.ucsc.edu/goldenpath/help/bedgraph.html). bedGraph files can be visualized in the Broad's IGV (Integrative Genomics Viewer) application (https://software.broadinstitute.org/software/igv/download) or in the UCSC Genome Browser (https://genome.ucsc.edu/).

  • Go to the UCSC Genome Browser: https://genome.ucsc.edu/
  • Select Genomes from the top menu bar
  • Select Human from POPULAR SPECIES
    • under Human Assembly select Feb 2009 (GrCh37/hg19)
    • select GO
  • In the hg19 browser page, the Layered H3K27Ac track is a signal track
    • the x-axis is the genome position
    • the y-axis represents the count of ChIP-seq reads that overlap each position
      • where the ChIP'd protein is H3K27AC (histone H3, acetylated on the Lysine at amino acid position 27)

The bedtools genomecov function (https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html), with the -bg (bedgraph) option produces output in bedGraph format. Here we'll analyze the per-base coverage of yeast RNAseq reads in our merged yeast gene regions.

Make sure you're in an idev session, then prepare a directory for this exercise.

Code Block
languagebash
titlePrepare for bedtools coverage
idev -m 120 -N 1 -A OTH21164 -r CoreNGS-day5
# or
idev -m 90 -N 1 -A OTH21164 -p development

module load biocontainers
module load bedtools

mkdir -p $SCRATCH/core_ngs/bedtools_genomecov
cd $SCRATCH/core_ngs/bedtools_genomecov 
cp $CORENGS/catchup/bedtools_merge/merged*bed .
cp $CORENGS/yeast_rnaseq/yeast_mrna.sort.filt.bam* .

Then calling bedtools genomecov is easy. The -bg option says to report the depth in bedGraph format.

Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools_genomecov
bedtools genomecov -bg -ibam yeast_mrna.sort.filt.bam > yeast_mrna.genomecov.bedGraph

wc -l yeast_mrna.genomecov.bedGraph # 1519274 lines

The bedGraph (BED3+) format has only 4 columns: chrom start end value and does not need to include positions with 0 reads. Here the count is the number of reads covering each base in the region given by chrom start end, as you can see looking at the first few lines with head:

Code Block
chrI    4348    4390    2
chrI    4390    4391    1
chrI    4745    4798    2
chrI    4798    4799    1
chrI    4949    4957    2
chrI    4957    4984    4
chrI    4984    4997    6
chrI    4997    4998    5
chrI    4998    5005    4
chrI    5005    5044    2
chrI    5044    5045    1
chrI    6211    6268    2
chrI    6268    6269    1
chrI    7250    7257    3
chrI    7257    7271    4
chrI    7271    7274    6
chrI    7274    7278    7
chrI    7278    7310    8
chrI    7310    7315    6
chrI    7315    7317    5 

Because this bedGraph file is for the small-ish (12Mb) yeast genome, and for reads that cover only part of that genome, it is not too big – only ~34M. But depending on the species and read depth, bedGraph files can get very large, so there is a coresponding binary format called bigWig (see https://genome.ucsc.edu/goldenpath/help/bigWig.html). The program to covert a bedGraph file to bigWig format is part of the UCSC Tools suite of programs. Look for it with module spider, and note that you can get information about all the tools in it using module spider with a specific container version:

Code Block
# look for the ucsc tools package
module spider ucsc

# specifying a specific container version will show more information about the package
module spider ucsc_tools/ctr-357--0

# displays information including the programs in the package:
  - bedGraphToBigWig
  - bedToBigBed
  - faToTwoBit
  - liftOver
  - my_print_defaults
  - mysql_config
  - nibFrag
  - perror
  - twoBitToFa
  - wigToBigWig

Looking at the help for bedGraphToBigWig, we'll need a file of chromosome sizes. We can create one from our BAM header, using a Perl substitution script, which I prefer to sed (see Tips and tricks#perlpatternsubstitution):

Code Block
languagebash
module load ucsc_tools

cd $SCRATCH/core_ngs/bedtools_genomecov
bedGraphToBigWig  # look at its usage

# create the needed chromosome sizes file from our BAM header
module load samtools
samtools view -H yeast_mrna.sort.filt.bam | grep -P 'SN[:]' | \
  perl -pe 's/.*SN[:]//' | perl -pe 's/LN[:]//' > sc_chrom_sizes.txt

cat sc_chrom_sizes.txt

# displays:
chrI    230218
chrII   813184
chrIII  316620
chrIV   1531933
chrV    576874
chrVI   270161
chrVII  1090940
chrVIII 562643
chrIX   439888
chrX    745751
chrXI   666816
chrXII  1078177
chrXIII 924431
chrXIV  784333
chrXV   1091291
chrXVI  948066
chrM    85779

Finally, call bedGraphToBigWig after sorting the bedGraph file again using the sort format bedGraphToBigWig likes. (You can try calling bedGraphToBigWig without sorting to see the error).

Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools_genomecov
export LC_COLLATE=C
sort -k1,1 -k2,2n yeast_mrna.genomecov.bedGraph > yeast_mrna.genomecov.sorted.bedGraph
bedGraphToBigWig yeast_mrna.genomecov.sorted.bedGraph sc_chrom_sizes.txt yeast_mrna.genomecov.bw

See the size difference between the bedGraph and the bigWig files. The bigWig (9.7M) is less that 1/3 the size of the bedGraph (34M).

Code Block
languagebash
cd $SCRATCH/core_ngs/bedtools_genomecov
ls -lh yeast_mrna.genome*

Since the bigWig file is binary, not text, you can't use commands like cat, head, tail on them directly and get meaningful output. Instead, just as zcat converts gzip'd files to text, and samtools view convets binary BAM files to text, the bigWigToBedGraph program can convert binary bigWig format to text. That's a different BioContainers module (ucsc-bigwigtobedgraph) and the default container version doesn't work, so we'll specifically load one that does:

Code Block
languagebash
# The default version of is broken, so load this specific biocontainers version
module load ucsc-bigwigtobedgraph/ctr-357--1

# see usage for bigWigToBedGraph:
bigWigToBedGraph

cd $SCRATCH/core_ngs/bedtools_genomecov
# use the program to view a few lines of the binary bigWig file
bigWigToBedGraph yeast_mrna.genomecov.bw stdout | head