Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Building a reference index involves taking a FASTA file as input, with each chromosome (or contig) as a separate FASTA entry, and producing an aligner-specific set of files as output. Those output index files are then used to perform the sequence alignment, and alignments are reported using coordinates referencing names and offset positions based on the original FASTA file contig entries.

hg19 is way too big for us to index here so we will use an existing set of BWA hg19 index files located at:

Code Block
languagebash
titleBWA hg19 index location
/work/projects/BioITeam/ref_genome/bwa/bwtsw/hg19
Tip

which the BioITeam maintains in its /work/projects/BioITeam/ref_genome area,

We can index the references for the yeast genome, the human miRNAs, and the V. cholerae genome, because they are all tiny compared to the human genomesmall, so we'll grab the FASTA files for yeast and human miRNAs references and build each index right before we use them. We will also obtain the special GenBank file that contains both the V. cholerae genome sequence and annotations (a .gbk file). These FASTA files, which you staged above, are:

Code Block
languagebash
titleReference FASTA locations
/work/projects/BioITeam/projects/courses/Core_NGS_Tools/references/sacCer3.fa
/work/projects/BioITeam/projects/courses/Core_NGS_Tools/references/hairpin_cDNA_hsa.fa
/work/projects/BioITeam/projects/courses/Core_NGS_Tools/references/vibCho.O395.gbk

hg19 is way too big for us to index here so we will use an existing set of BWA hg19 index files located at:

Code Block
languagebash
titleBWA hg19 index location
/work/projects/BioITeam/ref_genome/bwa/bwtsw/hg19
Tip

The BioITeam maintains a set of reference indexes for many common organisms and aligners. They can be found in aligner-specific sub-directories of the /work/projects/BioITeam/ref_genome area. E.g.:

Code Block
languagebash
/work/projects/BioITeam/ref_genome/
  bowtie2/
  bwa/
  hisat2/
  kallisto/
  star/
  tophat/

Exercise #1: BWA global alignment – Yeast ChIP-seq

...

It is often useful to know what chromosomes/contigs are in a FASTA file before you start an alignment so that you're familiar with the contig naming convention – and to verify that it's the one you expect.  For example, chromosome 1 is specified differently in different references and organisms: chr1 (USCS human), chrI (UCSC yeast), or just 1 (Ensembl GRCh38 human GRCh37).

A FASTA file We saw that a FASTA consists of a number of contig name entries, each one starting with a name line of the form belowright carat ( > ) character, followed by many lines of bases.base characters. E.g.:

Code Block
>contigName>chrI
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG
GCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTAC
CCTGTCCCATTCAACCATACCACTCCGAACCACCATCCATCCCTCTACTT

How do we dig out just the lines that have the contig names and ignore all the sequences? Well, the contig name lines all follow the pattern above, and since the > character is not a valid base, it will never appear on a sequence line.

We've discovered a pattern (also known as a regular expression) to use in searching, and the command line tool that does regular expression matching is grep (general regular expression parser). Read more about grep here: Advanced commands: grep.

Regular expressions are so powerful that nearly every modern computer language includes a "regex" module of some sort. There are many online tutorials for regular expressions (, and a few several slightly different "flavors" of them). But the most common is the Perl style (http://perldoc.perl.org/perlretut.html), which was one of the fist and still the most powerful (there's a reason Perl Perl used extensively when assembling the human genome). We're only going to use the most simple of regular expressions here, but learning more about them will pay handsome dividends for you in the future (there's a reason Perl was used a lot when assembling the human genome).

Here's how to execute grep to list contig names in a FASTA file.

Code Block
languagebash
titlegrep to match contig names in a FASTA file
grep -P '^>' sacCer3.fa | more

Notes:

  • The -P option tells grep to use Perl-style regular expression patterns.
    • This makes including special characters like Tab ( \t ), Carriage Return ( \r ) or Linefeed ( \n ) much easier that the default Posix POSIX paterns.
    • While it is not really required here, it generally doesn't hurt to include this option.
  • '^>' is the regular expression describing the pattern we're looking for (described below)

  • sacCer3.fa is the file to search. Lines
    • lines with text that match our pattern will be written to standard output
    ;
    • non matching lines will be omitted
    .
  • We pipe to more just in case there are a lot of contig names.

Now down to the nuts and bolts of our the pattern, : '^>'

First, the single quotes around the pattern – they are only a signal for the bash shell. As part of its friendly command line parsing and evaluation, the shell will often look for special characters on the command line that mean something to it (for example, the $ in front of an environment variable name, like in $SCRATCH). Well, regular expressions treat the $ specially too – but in a completely different way! Those single quotes tell the shell "don't look inside here for special characters – treat this as a literal string and pass it to the program". The shell will obey, will strip the single quotes off the string, and will pass the actual pattern, ^>, to the grep program. (Aside: We've see that the shell does look inside double quotes ( " ) for certain special signals, such as looking for environment variable names to evaluate.)

...