A healthy taste of resources available, specifically for this course - not a comprehensive catalog.
Linux
Community Resources
Sequencing Technologies
- Overviews
Technology intros
- Illumina (Solexa) – most common "short" (< 300 bp) read sequencing
- Newer "single molecule" sequencing
- Older technologies (less common now)
Fastq analysis/manipulation
Reference genomes
Basic alignment and aligners
- Comparison of different aligners
- File formats
- input: fastq format
- output: the SAM (Sequence Alignment Map) format specification (pdf)
- Aligners
- Anna has some TACC-aware alignment scripts you might find useful
- bwa alignment
- /work/01063/abattenh/code/script/align/align_bwa_illumina.sh
- bowtie2 alignment
- /work/01063/abattenh/code/script/align/align_bowtie2_illumina.sh
- merging sorted BAM files (read-group aware)
- /work/01063/abattenh/code/script/align/merge_sorted_bams.sh
- email or come talk to me if you have questions or problems
Transcriptome-aware aligners
Alignment analysis
- SAM (Sequence Alignment Map) format specification (pdf)
- samtools – http://samtools.sourceforge.net/ by Heng Li
- sam/bam conversion, flag filtering, sorting, indexing, duplicate filtering
- Picard toolkit – http://broadinstitute.github.io/picard/
- sam/bam utilities that are read-group aware
- especially MarkDuplicates for flagging duplicate alignments
- SAMStat - http://samstat.sourceforge.net/
- produces detailed graphical statistics for sam/bam files.
- bedtools – http://bedtools.readthedocs.org/en/latest/
- Swiss army knife for all manner of common bed, bam, vcf, gff file manipulation such as:
- intersecting bam or bed with annotation files
- merging overlapping regions
- generation of per-base genome-wide signal in bedGraph format
- bedtools coverage
- bedtools multicov
- extracting fasta corresponding to regions
- Available in the TACC module system
File formats and conversion
- SAM format specification – http://samtools.github.io/hts-specs/SAMv1.pdf
- crucial for performing format conversions, of which ChIP-seq analysis can have many
- Genome browser file formats – http://genome.ucsc.edu/FAQ/FAQformat.html
- BED, bedGraph, narrowPeak and many more
- SRA (Sequence Read Archive) from NCBI
- UCSC file format conversion scripts - useful for getting to/from wig and bed to corresponding binary formats.
- Make sure you download the correct script for your operating system!
- A directory containing these tools can be found on stampede at /work/01063/abattenh/local/UCSC_utilities
- Mason program for simulating second-generation sequencing reads
UCSC Genome Browser
RNAseq/Transcriptome analysis
Variant calling
Genome Annotation
- GREAT: an analysis tool that takes bed files as input and outputs enriched genes, GO-terms, motifs, etc.
- for human, mouse, zebrafish
- MEME-suite: a motif identification and discovery tool. Works with most species.
- takes fasta files as input, so filter your bam/bed files to get the regions of interest, then convert over using bamtofastq in bedtools.