Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

That is a lot to process! For now, we just want to read in a SAM file and output a BAM file. The input format is auto-detected, so we don't need to specify it (although you do in v0.1.19). We just need to tell the tool to output the file in BAM format, and to include the header records.

Expand
titleSetup (if needed)
Code Block
languagebash
titleGet the alignment exercises files
mkdir -p $SCRATCH/core_ngs/alignment/yeast_bwa
cd $SCRATCH/core_ngs/alignment/yeast_bwa
cp $CORENGS/catchup/yeast_bwa/yeast_pairedend.sam .
Code Block
languagebash
titleConvert SAM to binary BAM
cd $SCRATCH/core_ngs/alignment/yeast_bwa
cat yeast_pairedend.sam | samtools view -b -o yeast_pairedend.bam yeast_pairedend.sam 
  • the -b option tells the tool to output BAM format
  • the -o option specifies the name of the output BAM file that will be created
  • we pipe the entire SAM file to samtools view so that the header records are included (required for SAM → BAM conversion)
    • samtools view reads its input from standard input by default

How do you look at the BAM file contents now? That's simple. Just use samtools view without the -b option. Remember to pipe output to a pager!

...

Code Block
titlesamtools sort usage
Usage: samtools sort [options...] [in.bam]
Options:
  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)
  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]
  -n         Sort by read name
  -n         Sort by read name
  -t TAG     Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
  -o FILE    Write final output to FILE rather than standard output
  -T PREFIX  Write temporary files to PREFIX.nnnn.bam
  -@, --threads INT
             Set number of sorting and compression threads [1]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]

In most cases you will be sorting a BAM file from name order to locus order. You can use either -o or redirection with > to control the output.

Expand
titleSetup (if needed)

Copy aligned yeast BAM file

Code Block
languagebash
mkdir -p mkdir -p $SCRATCH/core_ngs/alignment/yeast_bwa
cd $SCRATCH/core_ngs/alignment/yeast_bwa
cp $CORENGS/???catchup/yeast_bwa/yeast_pairedend.bam $SCRATCH/core_ngs/alignment/yeast_bwa.

To sort the paired-end yeast BAM file by coordinateposition, and get a BAM file named yeast_pairedend.sort.bam as output, execute the following command:

Code Block
languagebash
titleSort a BAM file
cd $SCRATCH/core_ngs/alignment/yeast_bwa
samtools sort -O bam -T yeast_pairedend.tmp yeast_pairedend.bam > yeast_pairedend.sort.bam
  • The -O options says the output Output format should be BAM
  • The -T options gives a prefix for temporary Temporary files produced during sorting
    • sorting large BAMs will produce many temporary files during processing
  • By default sort writes its output to standard output, so we use > to redirect to a file named yeast_pairedend.sort.bam

...

samtools index

Many tools (like the UCSC Genome Browser IGV, the Integrative Genomics Viewer) only need to use portions of a BAM file at a given point in time. For example, if you are viewing alignments that are within a particular gene, alignment records on other chromosomes do not need to be loaded. In order to speed up access, BAM files are indexed, producing BAI files which allow fast random access. This is especially important when you have many alignment records.

...

Code Block
titlesamtools index usage
Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
  -b       Generate BAI-format index for BAM files [default]
  -c       Generate CSI-format index for BAM files
  -m INT   Set minimum interval size for CSI indices to 2^INT [14]
  -@ INT   Sets the number of threads [none]

The syntax here is way, way easier. We want a BAI-format index which is the default. (CSI-format is used with extremely long contigs, which don't apply here - the most common use case is for polyploid plant genomes).

So all we have to type provide is the sorted BAM:

Code Block
languagebash
titleIndex a sorted bam
samtools index yeast_pairedend.sort.bam

...

Code Block
titlesamtools flagstat output
1184360 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
547664 + 0 mapped (46.24%:-nan% : N/A)
1184360 + 0 paired in sequencing
592180 + 0 read1
592180 + 0 read2
473114 + 0 properly paired (39.95%:-nan% : N/A)
482360 + 0 with itself and mate mapped
65304 + 0 singletons (5.51%:-nan% : N/A)
534 + 0 with mate mapped to a different chr
227 + 0 with mate mapped to a different chr (mapQ>=5)

...

Expand
titleHint

Divide the number of properly paired reads by the number of mapped reads:

Code Block
languagebash
echo $((awk 'BEGIN{ print 473114 / 547664 ))}'
# or
awk 'BEGIN{ print 473114echo $(( 473114 * 100 / 547664 }'))
Expand
titleAnswer

About 86% of mapped read were properly paired. This is actually a bit on the low side for ChIP-seq alignments which typically over 90%.

...