Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Now you have created the alignment directory, moved into it, and created a subdirectory for our raw fastq files.  We will be using four data sets that consist of five files (since the paired-end data set has two separate files for each of the R1 and R2 reads).  To copy them over, execute something like:

Code Block
languagebash
cd $SCRATCH/core_ngs/alignment/fastq
cp /corral-repl/utexas/BioITeam/core_ngs_tools/alignment/*fastq.gz .

...

  1. Trim the FASTQ sequences down to 50 with fastx_clipper
    • this removes most of any 5' adapter contamination without the fuss of specific adapter trimming w/cutadapt
  2. Prepare the sacCer3 reference index for bwa using bwa index (
    • this is done once, and re-used for later alignments
    )
  3. Perform a global bwa alignment on the R1 reads (bwa aln) producing a BWA-specific binary .sai intermediate file
  4. Perform a global bwa alignment on the R2 reads (bwa aln) producing a BWA-specific binary .sai intermediate file
  5. Perform pairing of the separately aligned reads and report the alignments in SAM format using bwa sampe
  6. Convert the SAM file to a BAM file (samtools view)
  7. Sort the BAM file by genomic location (samtools sort)
  8. Index the BAM file (samtools index)
  9. Gather simple alignment statistics (samtools flagstat and samtools idxstat)

...

Like other tools you've worked with so far, you first need to load bwa using the module system.   Go ahead and do that now, and then enter bwa with no arguments to view the top-level help page (many NGS tools will provide some help when called with no arguments).

Code Block
languagebash
module load bwa
bwa
Expand
titleTop-level BWA help
Code Block
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.12-r1039
Contact: Heng Li <lh3@sanger.ac.uk>
Usage:   bwa <command> [options]
Command: index         index sequences in the FASTA format
         mem           BWA-MEM algorithm
         fastmap       identify super-maximal exact matches
         pemerge       merge overlapping paired ends (EXPERIMENTAL)
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         shm           manage indices in shared memory
         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         bwt2sa        generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
      There are three alignment algorithms in BWA: `mem', `bwasw', and
      `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
      first. Please `man ./bwa.1' for the manual.

As you can see, bwa include many subcommands sub-commands that perform most of the tasks we are interested in.

...

We might be able to get away with just using this literal alone as our regex, specifying '>' as the command line argument. But for grep, the more specific the pattern, the better. So we constrain where the > can appear on the line. The special carat ( ^ ) character represents "beginning of line". So ^> means "beginning of a line followed by a > character, followed by anything. (Aside: the dollar sign ( $ ) character represents "end of line" in a regex. There are many other special characters, including period ( . ), question mark ( ? ), pipe ( | ), parentheses ( ( ) ), and brackets ( [ ] ), to name the most common.)

...

Create the batch submission script specifying a wayness of 8 2 (8 2 tasks per node) on the normal queue and a time of 1 hour, then submit the job and monitor the queue:

Code Block
languagebash
titleSubmit BWA alignment job
launcher_creator.py -n aln -j aln.cmds -t 01:00:00 -q normal -w 82
sbatch aln.slurm
showq -u

...

Expand
titleAnswer
Recall that these are 100 bp reads and we did not remove adapter contamination. There will be a distribution of fragment sizes – some will be short – and those short fragments may not align without adapter removal (fastx_trimmer or cutadapt).

Exercise

...

#2: Bowtie2 - Vibrio Cholerae RNA-seq

While we have focused on aligning eukaryotic data, the same tools can be used to perform identical functions with prokaryotic data.  The major differences are less about the underlying data and much more about the external/public databases established to store and distribute reference data.  For example, the Illumina iGenome resource provides pre-processed and uniform reference data, designed to be out-of-the-box compatible with aligners like bowtie2 and bwa.  However, the limited number of available species are heavily biased towards model eukaryotes.  If If we wanted to study a prokaryote, the reference data must be downloaded from a resource like GenBank, and processed/indexed similarly to the procedure for mirbase.  However, while  

While the alignment procedure for prokaryotes is broadly analogous, the reference preparation process is somewhat different, and will involve use of a biologically-oriented scripting library called BioPerl.  In this exercise, we will use some RNA-seq data from Vibrio cholerae, published last year on GEO here, and align it to a reference genome.

Overview of Vibrio cholerae alignment workflow with Bowtie2

Alignment of this prokaryotic data follows the workflow below. Here we will concentrate on steps 1 and 2.

  1. Prepare the vibCho reference index for bowtie2 from GenBank files using BioPerl
  2.  Align reads using bowtie2, producing a SAM file
  3. Convert the SAM file to a BAM file (samtools view) 
  4. Sort the BAM file by genomic location (samtools sort)
  5. Index the BAM file (samtools index)
  6. Gather simple alignment statistics (samtools flagstat and samtools idxstat)

...

As noted earlier, many microbial genomes are available through repositories like GenBank that use specific file format conventions for storage and distribution of genome sequence and annotations.  The "Genbank" GenBank file format is a text file that can be parsed to yield other files that are compatible with the pipelines we have been implementing.  Go ahead and look at some of the contents of a Genbank GenBank file with the following commands:

Code Block
languagebash
cd $WORK/core_ngs/references
less vibCho.O395.gbk
grep -A 5 ORIGIN vibCho.O395.gbk

As the 'less' command shows, the file begins with a description of the organism and some source information, and the contains annotations for each bacterial gene.  The 'The grep' command shows that, indeed, there is sequence information here (flagged by the word ORIGIN) that could be exported into a FASTA file.  There There are a couple ways of extracting the information we want, namely the reference genome and the gene annotation information, but a convenient one (that is available through the module system at TACC) is BioPerl.

We load BioPerl like we have loaded other modules, with the caveat that we must load regular Perl before loading BioPerl:

Code Block
languagebash
module load perl
module load bioperl

These commands make several scripts directly available to you.  The The one we will use is called bp_seqconvert.pl, and it is a BioPerl script used to interconvert inter-convert file formats like FASTA, GBK, and others.  We will use this script to get both This script produces two output files:

  • a FASTA format file for indexing and alignment

...

  • GFF file (standing for General Feature Format)

...

  • contains information about all genes (or, more generally, features) in the genome
    • remember, annotations such as GFFs must always match the reference you are using

To .  To see how to use the script, just execute:

Code Block
languagebash
bp_seqconvert.pl

Clearly, there are many file formats that we can use this script to convert.  In our case, we are moving "from" genbank "to" fasta, so the commands we would execute to produce and view the FASTA files would look like this:

...