Your Instructors
Anna Battenhouse, Associate Research Scientist, abattenhouse@utexas.edu
BA English literature, 1978
Commercial software development 1982 – 2005
Joined Iyer Lab 2007 (“retirement career”)
BS Biochemistry, UT Austin, 2013
- Joined Marcotte lab and Biomedical Research Support Facility (BRCF) in summer 2017
Haridha Shivram, haridh@utexas.edu
6th year graduate student in the Iyer Lab
Claire McWhite, claire.mcwhite@utexas.edu
5th year graduate student in the Marcotte Lab
About the Iyer Lab
Dr. Vishy Iyer, PI | |
Main focus is functional genomics
| |
Research methods include
| |
|
Communication
Post its
Green post-it – I'm good at the moment.
Pink post-it – I need a bit of help.
Conventions
If you see a block of text like this:
ls -h
it means, type the command ls -h
into a terminal window, hit return, and see what happens.
We intend this course to offer as much self-learning as possible. Consequently, you'll find many sections like this - click on the triangle to expand them:
and some sections like this:
Course goals
- Hands-on, tutorial style – learn by doing
- common bioinformatics tools & file formats
- Introduce NGS vocabulary
- both high-level view and practice with specific tools
- Cover the NGS basics
- the first few things you'll do after receiving raw sequences
- raw sequence preparation
- alignment to reference
- basic alignment analysis
- the first few things you'll do after receiving raw sequences
- Understand and practice required skills
- Get you comfortable with Linux and TACC – your best "frenemies"
- Make you self-sufficient enough in 4 days to become experts over time
- Show some "best practices" for working with NGS data
NGS Challenges
Diverse skill set requirements
|
Large and growing datasets
NGS methods produce staggering amounts of data!
Typical dataset these days
- yeast: 5 – 20 million reads
- human: 20 – 250 million reads
- single or paired end, length 75 – 250 bases
The initial fastq files are big (100s of MB to GB) – and they're just the start.
- Organization and naming conventions are critical.
- Your data can get out of hand very quickly!
progression of Iyer Lab datasets over time:
- 2008 – Yeast heat shock remodeling of chromatin
- 2 yeast datasets
- less than 2 million sequences
- 2010 – Allelic bias in CTCF binding
- 13 CTCF datasets from 3 GM cell lines
- ~200 million sequences
- 2012 – Transcription factor data analysis (ENCODE2)
- 32 ChIP-seq datasets gathered over 3 years (3 TFs across 11 cell lines)
- ~ 1 billion sequences
- 2013 – miRNA overexpression effects
- 42 RNAseq datasets (7 conditions)
- ~ 2.6 billion sequences
- 2014 – eQTL analysis of CTCF binding
- 52 very deeply sequenced CTCF datasets
- ~ 8 billion sequences
- 2018 – Functional analysis of glioblastoma tumors and cell lines
- nearly 500 datasets in total (ChIP-seq, RNAseq, miRNAseq, 4C, exome/genome sequencing)
- > 22 billion sequences