You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 39 Next »

Communication

Post its

Green post-it – I'm good at the moment.

Pink post-it – I need a bit of help.

Conventions

If you see a block of text like this:

Example code block
ls -h

it means, type the command ls -h into a terminal window, hit return, and see what happens.

We intend this course to offer as much self-learning as possible. Consequently, you'll find many sections like this - click on the triangle to expand them:

Hint sections will provide you some guidance on what to do next, but will not spell it out.

and some sections like this:

Solution sections will contain the commands so that you could copy-and-paste them if you have to. They will represent one method of answering the question – but there are often many ways to skin a cat!

Your Instructors

  • Anna Battenhouse, Associate Research Scientist, abattenhouse@utexas.edu

    • BA English literature, 1978

    • Commercial software development 1982 – 2005

    • Joined Iyer Lab 2007 (“retirement career”)

    • BS Biochemistry, UT Austin, 2013

    • Joined Marcotte lab and Biomedical Research Support Facility (BRCF) in summer 2017
  • Haridha Shivram, haridh@utexas.edu

    • 6th year graduate student in the Iyer Lab

  • Claire McWhite, claire.mcwhite@utexas.edu

About the Iyer Lab

http://iyerlab.org/

Dr. Vishy Iyer, PI

Main focus is functional genomics

    • large-scale transciptional reprogramming
      in response to diverse stimuli
    • Encode consortium collaborator
    • work in human and yeast


Research methods include
  • microarrays (Dr. Iyer was co-inventor)

  • high-throughput sequencing (since 2007)
    • especially ChIP-seq
    • also RNA-seq, RIP-seq, MNase-seq ...
    • we now have nearly 2,000 NGS datasets

Course goals

  • Hands-on, tutorial style – learn by doing
    • common bioinformatics tools & file formats
  • Introduce NGS vocabulary
    • both high-level view and practice with specific tools
  • Cover the NGS basics
    • the first few things you'll do after receiving raw sequences
      • raw sequence preparation
      • alignment to reference
      • basic alignment analysis
  • Understand and practice required skills
    • Get you comfortable with Linux and TACC – your best "frenemies"
    • Make you self-sufficient enough in 4 days to become experts over time
    • Show some "best practices" for working with NGS data

NGS Challenges

Diverse skill set requirements

  • Analysis – making sense of raw data
    • one part bioinformatics and statistics
    • one part scripting / programming
      • Linux command line
      • High Performance Computing (TACC)
      • bash scripting (grep, awk, sed)
      • R, python, perl
  • Management – making order out of chaos
    • one part organization
    • one part data wrangling
  • Adoption of best practices is critical!

Large and growing datasets

NGS methods produce staggering amounts of data!

Typical dataset these days

  • yeast:  5 – 20 million reads
  • human:  20 – 250 million reads
  • single or paired end, length 75 – 250 bases

The initial fastq files are big (100s of MB to GB) – and they're just the start.

  • Organization and naming conventions are critical.
  • Your data can get out of hand very quickly!

progression of Iyer Lab datasets over time:

  • 2008 – Yeast heat shock remodeling of chromatin
    • 2 yeast datasets
    • less than 2 million sequences
  • 2010 – Allelic bias in CTCF binding
    • 13 CTCF datasets from 3 GM cell lines
    • ~200 million sequences
  • 2012 – Transcription factor data analysis (ENCODE2)
    • 32 ChIP-seq datasets gathered over 3 years (3 TFs across 11 cell lines)
    • ~ 1 billion sequences
  • 2013 – miRNA overexpression effects
    • 42 RNAseq datasets (7 conditions)
    • ~ 2.6 billion sequences
  • 2014 – eQTL analysis of CTCF binding
    • 52 very deeply sequenced CTCF datasets
    • ~ 8 billion sequences
  • 2018 – Functional analysis of glioblastoma tumors and cell lines
    • nearly 500 datasets in total (ChIP-seq, RNAseq, miRNAseq, 4C, exome/genome sequencing)
    • > 22 billion sequences


  • No labels