Your Instructors
Most of us are members (or alumni) of the functional genomics lab of Vishwanath Iyer, UT Austin.
- Anna Battenhouse, Associate Research Scientist, Iyer Lab, abattenhouse@utexas.edu
- BA English literature, 1978
- Commercial software development 1982 – 2005
- Joined Iyer Lab 2007 (“retirement career”)
- BS Biochemistry, 2013
- Amelia Weber Hall, Graduate Student, Iyer Lab, ameliahall@utexas.edu
- 5th year Microbiology graduate student
- Laboratory Technician at UT 2007-2010
- BS Molecular Genetics, 2007
- Nathan Abell, Research Assistant, Xhemalce Lab, abell.nathan@gmail.com
- Undergraduate researcher in Iyer Lab 2011-2013
- BS Molecular Biology, UT, 2013
- Research Assistant
- Dakota Derryberry, Graduate Student, Wilke Lab, dakotaz@utexas.edu
- ???
About the Iyer Lab
- Main focus is functional genomics
- large-scale transciptional reprogramming in response to diverse stimuli
- Encode consortium collaborator
- work in human and yeast
- Research methods include
- microarrays (Dr. Iyer was co-inventor)
- high-throughput sequencing (since 2007)
- especially ChIP-seq
- also RNA-seq, RIP-seq, MNase-seq ...
- we now have > 1,500 NGS datasets
Communication
Post its
Green post-it – I'm good at the moment.
Pink post-it – I need a bit of help.
Conventions
Text that you find in courier font
refers to a program or file name on a computer.
If you see a block of text like this:
ls -h
it means, "type the command ls -h
into a terminal window, hit return, and see what happens".
We intend this course to offer as much self-learning as possible. Consequently, you'll find many sections like this - click on the triangle to expand them:
and some sections like this:
Goals and challenges
Course goals
- Hands-on, tutorial style – learn by doing
- Cover the NGS tool basics – the first few things you'll do after receiving raw sequences
- Get you comfortable with Linux and TACC – your best "frenemies"
- Make you self sufficient in 4 days to become experts over time
- Show some "best practices" for working with NGS data
Challenges
Large and growing datasets
NGS methods procude staggering amounts of data!
Typical dataset these days
- yeast: 5 – 20 million reads
- human: 20 – 100 million reads
- paired end, length 75 – 100 bases
The initial fastq files are big (100s of MB to GB) – and they're just the start.
- Organization and naming conventions are critical.
- Your data can get out of hand very quickly!
progression of Iyer Lab ChIP-seq datasets over time
- 2008 – Yeast heat shock remodeling of chromatin
- 2 yeast datasets
- less than 2 million reads
- 2010 – Allelic bias in CTCF binding
- 13 CTCF datasets from 3 GM cell lines
- ~200 million reads
- 2012 – Analysis of 3 TFs across 11 cell lines
- 32 datasets gathered over 3 years
- ~ 1 billion reads
- 2014 – QTL analysis of CTCF binding
- 52 very deeply sequenced CTCF datasets
- ~ 8 billion reads
- in progress – Functional analysis of glioblastoma tumors and cell lines
- > 300 datasets so far
- > 17 billion reads
Data wrangling best practices summary
keep fastq files compressed
- Most sequencing facilities will give you compressed sequencing data files
- gzip format (.gz extension) for individual files
- tar or zip format for directories of files
- Even with compression it's easy to run out of storage space!
You may be tempted un-compress your sequencing files to manipulate them more directly
- resist the temptation to gunzip!
- nearly all modern bioinformatics tools are able to work on .gz files
- there are techniques for working with compressed files without ever un-compressing them
arrange adequate storage space
- Obtain an allocation on TACC's corral disk array (initial 5 TB are no-cost)
- Stage your active projects on corral
- copy data to $WORK or $SCRATCH for analysis
- copy important analysis products back to corral
- Periodically back up corral directories to ranch tape archive
backup analysis artifacts regularly
- Obtain an allocation on TACC's ranch tape archive system
- 10 TB a good initial number
- free! and under-utilized
- Periodically back up your corral directories to ranch tape archive
distinguish between types of data
Artifacts from different stages of the analysis will have different archival requirements.
- Original sequence data (fastq files)
- must be backed up!
- Alignments
- usually larger than original fastqs
- should be backed up once stable
- Peak calling artifacts
- Downstream analysis artifacts
While a project is active you will want to keep more intermediate artifacts for reference. Many of these can be deleted after publication.
track your analysis steps
Your analyses should be reproducible by others so you need to keep the equivalent of a lab notebook to document your protocols.
- Keep "work files" that detail analysis steps performed
- here's an Example alignment work file