UCSC genome browser, SRA data downloads

UCSC Genome Browser intro

The UCSC Genome Browser is an invaluable resource both for obtaining public sequencing data and for visualizing it.

Demo - Data resources

http://genome.ucsc.edu/ ? Genome Browser ? submit
navigaion
- type GAPDH in gene box ? jump
- note zoom out/zoom in buttons; click on position or click/drag
track detail
- click "Multiz Alignments" to expand track detail
- click on one of the SNP to expand track detail
  - then click on the snp name to see details
selecting tracks
- under "Regulation" section, change Regulation track from "show" to "hide" ? refresh
- right click "Multiz Alignments" ? hide
- under "Phenotype and Disease Association" change GWAS Catalog from "hide" to "squish" ? refresh
type PRNP in gene box ? jump.
- click on "NHGRI Catalog..." track description to expand detail
- note correspondence between SNPs (SNP 132) and disease SNPs (GWAS)
- click on one of the disease SNPs for detail

Exercise 1

Using the UCSC Genome Browser, determine whether Craig Venter or James Watson has a higher risk of Altzheimer's disease.

Hints, Solution

??Demo - Downloading annotation data

For RNAseq you often need a GTF file, but how do you find them? One way is to download annotations from the UCSC Table browser in GTF format:

http://genome.ucsc.edu/cgi-bin/hgTables
- clade: Mammal, genome: Human, assembly: hg19
- group: Genes and Gene Prediction tracks, track: RefSeq genes
- output format: GTF - gene transfer format
- optional: enter filename in typein box
- ? get output

Exercise 1

Using the UCSC Genome Browser, find and download a list of high-sequencing-depth regions in BED format.

Hints, Solution

SRA Toolkit overview

SRA (Sequence Read Archive) is an NCBI-defined interchange format for NGS data. The idea is that before submitting your data to NCBI, you convert whatever format it is in (fastq, bam, etc.) to SRA format using one of the "load" tools. Then, the data can be downloaded from NCBI by anyone and extracted in one of a number of different formats as desired (ABI csfasta/qual, fastq).

While this sounds like a great idea (someone else taking care of format interchange issues for you!), the toolkit is no longer being actively developed except for bug fixes. However there is a lot of interesting data out there that's only available as SRAs so it is worthwhile knowing how to use it.

The SRA Toolkit documentation, such that it is, is located at the NCBI website.

SRA Example

You have aligned a ChIP-seq dataset to hg19 and have a .bam file. You want to upload the data to NCBI. You use the bam-load tool:

bam-load -o mySRA.sra myAlignment.bam

The raw reads can be then be extracted to fastq using fastq-dump:

fastq-dump mySRA.sra

Looks deceptively simple but you can run into problems. For one thing, SRA toolkit versions change often and are not always compatible. So if you get any weird errors, check for a newer (or sometimes older) toolkit version.

Space shortcuts

Page tree

UCSC Genome Browser intro

Demo - Data resources

Exercise 1

??Demo - Downloading annotation data

Exercise 1

SRA Toolkit overview

SRA Example