Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Warning

When following along here, please start an idev session for running any example commands:

Code Block

idev -m 60 -q development

Table of Contents

Illumina sequence data format (FASTQ)

...

Code Block
titleA four-line FASTQ file entry representing one sequence

@HWI-ST1097:104:D13TNACXX:4:1101:1715:2142 1:N:0:CGATGT
GCGTTGGTGGCATAGTGGTGAGCATAGCTGCCTTCCAAGCAGTTATGGGAG
+
=<@BDDD=A;+2C9F<CB?;CGGA<<ACEE*1?C:D>DE=FC*0BAG?DB6

...

Expand
titleAnswer
Code Block

head /corral-repl/utexas/BioITeam/ngs_course/intro_to_mapping/data/SRR030257_1.fastq

Executing the command above reports that the 2nd sequence has ID = @SRR030257.2 HWI-EAS_4_PE-FC20GCB:6:1:407:767/1, and the sequence TAAGCCAGTCGCCATGGAATATCTGCTTTATTTAGC

...

Code Block
titleUsing wc -l to count lines

wc -l $BI/ngs_course/intro_to_mapping/data/SRR030257_1.fastq

...

Code Block
titleUsing wc -l on a compressed file

gunzip -c $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz | wc -l

...

Expand
titleHow do I do math on the command line?

The bash shell has a really strange syntax for arithmetic: it uses a double-parenthesis operator. Go figure.

Code Block
titleArithmetic in Bash

echo $((2368720 / 4))

FASTQ Quality Assurance tools

...

Code Block
titleRunning FastQC example

# setup
cds
mkdir fastqc_test
cd fastqc_test
cp $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz .

# running the program
$BI/bin/FastQC/fastqc Sample_Yeast_L005_R1.cat.fastq.gz

...

Expand
titleAnswer
No Format
titlels -l shows something like this

drwxrwxr-x 4 abattenh G-803889     4096 May 20 22:59 Sample_Yeast_L005_R1.cat_fastqc
-rw-rw-r-- 1 abattenh G-803889   198239 May 20 22:59 Sample_Yeast_L005_R1.cat_fastqc.zip
-rwxr-xr-x 1 abattenh G-803889 51065629 May 20 22:59 Sample_Yeast_L005_R1.cat.fastq.gz

The Sample_Yeast_L005_R1.cat.fastq.gz file is what we analyzed, so FastQC created the other two items. Sample_Yeast_L005_R1.cat_fastqc is a directory (the "d" in "drwxrwxr-x"), so use ls Sample_Yeast_L005_R1.cat_fastqc to see what's in it. Sample_Yeast_L005_R1.cat_fastqc.zip is just a Zipped (compressed) version of the whole directory.

...

Code Block
titleFastQC results URL

http://loving.corral.tacc.utexas.edu/bioiteam/yeast_stuff/Sample_Yeast_L005_R1.cat_fastqc/fastqc_report.html

...

Code Block
titleRunning samstat on FASTQ example

# setup
cds
mkdir samstat_test
cd samstat_test
cp $BI/ngs_course/intro_to_mapping/data/SRR030257_1.fastq .

# run the program
$BI/bin/samstat SRR030257_1.fastq

...

Code Block
titleURL for viewing samstat results

http://loving.corral.tacc.utexas.edu/bioiteam/SRR030257_1.fastq.html

...

Code Block
titleRunning samstat on BAM example

# setup
cds
mkdir samstat_test2
cd samstat_test2
cp $BI/web/yeast_stuff/yeast_chip_sort.bam .

# run the program
$BI/bin/samstat yeast_chip_sort.bam

...

Code Block
titleURL for viewing samstat results

http://loving.corral.tacc.utexas.edu/bioiteam/yeast_stuff/yeast_chip_sort.bam.html

...

Code Block
titleFASTX_toolkit module description

module spider fastx_toolkit
module load fastx_toolkit

...

No Format
titlefastx_trimmer example

gunzip -c $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz | fastx_trimmer -l 50 -Q 33 > trimmed.fq

...

Expand
titleAnswer

You could supply the -z option like this:

No Format
titlefastx_trimmer example

gunzip -c $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz | fastx_trimmer -l 50 -Q 33 -z > trimmed.fq.gz

Or you could gzip the output yourself:

No Format
titlefastx_trimmer example

gunzip -c $BI/web/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz | fastx_trimmer -l 50 -Q 33 | gzip > trimmed.fq.gz

...

Expand
titleHint

Type fastx_ then tab to see their names
See all the programs like this:

Code Block
titlefastx toolkit programs

ls $TACC_FASTX_BIN

Adapter trimming

...

Code Block
titlecutadapt command for R1 sequences

cutadapt -m 22 -O 10 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
Code Block
titlecutadapt command for R2 sequences

cutadapt -m 22 -O 10 -a TGATCGTCGGACTGTAGAACTCTGAACGTGTAGA

...

Expand
titleThe gory details on the *-a* adapter sequence argument

Please refer to https://wikis.utexas.edu/display/GSAF/Illumina+-+all+flavors for Illumina library adapter layout.

The top strand, 5' to 3', of a read sequence looks like this.

No Format
titleIllumina library read layout

<P5 capture> <indexRead2> <Read 1 primer> [insert] <Read 2 primer> <indexRead1> <P7 capture>

The -a argument to cutadapt is documented as the "sequence of adapter that was ligated to the 3' end". So we care about the <Read 2 primer> for R1 reads, and the <Read 1 primer> for R2 reads.

The "contaminent" for adapter trimming will be the <Read 2 primer> for R1 reads. There is only one Read 2 primer:

Code Block
titleRead 2 primer, 5' to 3', used as R1 sequence adapter

AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

The "contaminent" for adapter trimming will be the <Read 1 primer> for R2 reads. However, there are three different Read 1 primers, depending on library construction:

No Format
titleRead 1 primer depends on library construction

TCTACACGTTCAGAGTTCTACAGTCCGACGATCA    # small RNA sequencing primer site
CAGGTTCAGAGTTCTACAGTCCGACGATCA        # "other"
TCTACACTCTTTCCCTACACGACGCTCTTCCGATCT  # TruSeq Read 1 primer site. This is the RC of the R2 adapter

Since R2 reads are the reverse complement of R1 reads, the R2 adapter contaminent will be the RC of the Read 1 primer used.

For ChIP-seq libraries where reads come from both DNA strands, the TruSeq Read 1 primer is always used.
Since it is the RC of the Read 2 primer, its RC is just the Read 1 primer back
Therefore, for ChIP-seq libraries only one cutadapt command is needed:

Code Block
titleCutadapt adapter sequence for ChIP-seq lib
raries, both R1 and R2 reads}
cutadapt -a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC

For RNAseq libraries, we use the small RNA sequencing primer as the Read 1 primer.
The contaminent is then the RC of this, minus the 1st and last bases:

No Format
titleSmall RNA library Read 1 primer, 5' to 3', used as R2 sequence adapter

TCTACACGTTCAGAGTTCTACAGTCCGACGATCA    # R1 primer - small RNA sequencing Read 1 primer site, 5' to 3'
TGATCGTCGGACTGTAGAACTCTGAACGTGTAGA    # R2 adapter contaminent (RC of R1 small RNA sequencing Read 1 primer)

...

No Format
titleExample of trimming adaptor sequence from right end of reads

flexbar -n 1 --adapters adaptors.fna --source example.fastq --target example.ar --format fastq-sanger --adapter-threshold 2 --adapter-min-overlap 6 --adapter-trim-end RIGHT_TAIL
Code Block
titleExample adaptors.fna file

>adaptor1
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>adaptor2
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG
>adaptor1_RC
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
>adaptor2_RC
CAAGCAGAAGACGGCATACGAGATNNNNNNNNGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

Note that flexbar only searches for the exact sequences given (with options to allow for a given number of mismatches) not the reverse complement of those sequences therefore you must provide them yourself.

Trimmomatic

Trimmomatic offers similar options with the potential benefit that many illumina adaptor sequences are already "built-in". It is available here.