View Source

Communication

Post its

Green post-it – I'm good at the moment.

Pink post-it – I need a bit of help.

Conventions

If you see a block of text like this:

ls -h

it means, type the command ls -h into a terminal window, hit return, and see what happens.

We intend this course to offer as much self-learning as possible. Consequently, you'll find many sections like this - click on the triangle to expand them:

Hint sections will provide you some guidance on what to do next, but will not spell it out.

and some sections like this:

Solution sections will contain the commands so that you could copy-and-paste them if you have to. They will represent one method of answering the question – but there are often many ways to skin a cat!

Your Instructors

Anna Battenhouse, Associate Research Scientist, abattenhouse@utexas.edu
- BA English literature, 1978
- Commercial software development 1982 – 2005
- Joined Iyer Lab 2007 (“retirement career”)
- BS Biochemistry, UT Austin, 2013
- Joined Marcotte lab and Biomedical Research Support Facility (BRCF) in summer 2017
Haridha Shivram, haridh@utexas.edu
- 6th year graduate student in the Iyer Lab
- Research Interests: Transcriptional and post-transcriptional regulation of gene expression
- Experienced in analyzing RNA-seq, ChIP-seq, RIP-seq, and CLIP-seq datasets
Claire McWhite, claire.mcwhite@utexas.edu
- 4th year graduate student in the Marcotte Lab

About the Iyer Lab

http://iyerlab.org/ Dr. Vishy Iyer, PI
Main focus is functional genomics large-scale transciptional reprogramming in response to diverse stimuli Encode consortium collaborator work in human and yeast
Research methods include microarrays (Dr. Iyer was co-inventor)
high-throughput sequencing (since 2007) especially ChIP-seq also RNA-seq, RIP-seq, MNase-seq ... we now have nearly 2,000 NGS datasets

Course goals

Hands-on, tutorial style – learn by doing
- common bioinformatics tools & file formats
Introduce NGS vocabulary
- both high-level view and practice with specific tools
Cover the NGS basics
- the first few things you'll do after receiving raw sequences
  - raw sequence preparation
  - alignment to reference
  - basic alignment analysis
Understand and practice required skills
- Get you comfortable with Linux and TACC – your best "frenemies"
- Make you self-sufficient enough in 4 days to become experts over time
- Show some "best practices" for working with NGS data

Core NGS Tools > Introduction > image2018-5-14_15-22-30.png

NGS Challenges

Diverse skill set requirements

Analysis – making sense of raw data
- one part bioinformatics and statistics
- one part scripting / programming
  - Linux command line
  - High Performance Computing (TACC)
  - bash scripting (grep, awk, sed)
  - R, python, perl
Management – making order out of chaos
- one part organization
- one part data wrangling
Adoption of best practices is critical!

Core NGS Tools > Introduction > image2015-5-21 19:13:35.png

Large and growing datasets

NGS methods produce staggering amounts of data!

Typical dataset these days

yeast: 5 – 20 million reads
human: 20 – 250 million reads
single or paired end, length 75 – 250 bases

The initial fastq files are big (100s of MB to GB) – and they're just the start.

Organization and naming conventions are critical.
Your data can get out of hand very quickly!

progression of Iyer Lab datasets over time:

2008 – Yeast heat shock remodeling of chromatin
- 2 yeast datasets
- less than 2 million sequences
2010 – Allelic bias in CTCF binding
- 13 CTCF datasets from 3 GM cell lines
- ~200 million sequences
2012 – Transcription factor data analysis (ENCODE2)
- 32 ChIP-seq datasets gathered over 3 years (3 TFs across 11 cell lines)
- ~ 1 billion sequences
2013 – miRNA overexpression effects
- 42 RNAseq datasets (7 conditions)
- ~ 2.6 billion sequences
2014 – eQTL analysis of CTCF binding
- 52 very deeply sequenced CTCF datasets
- ~ 8 billion sequences
2018 – Functional analysis of glioblastoma tumors and cell lines
- nearly 500 datasets in total (ChIP-seq, RNAseq, miRNAseq, 4C, exome/genome sequencing)
- > 22 billion sequences