...
NGS is smack dab in the middle of the Big Data revolution. Initial NGS fastq FASTQ files are big (100s of MB to GB) – and they're just the start.
...
- Most sequencing facilities will give you compressed sequencing data files
- gzip format (.gz extension) for individual files
- tar or zip format for directories of files
- Even with compression it's easy to run out of storage space!
You may be tempted un-compress decompress your sequencing files to manipulate them more directly
- resist the temptation to gunzip!
- nearly all modern bioinformatics tools are able to work on .gz files
- there are techniques for working with compressed files without ever un-compressing decompressing them
arrange adequate storage space
- At TACC
- Obtain an allocation on TACC's corral disk array (initial 5 TB are no-cost)
- Stage your active projects on corral
- or $WORK
- copy data to
- $SCRATCH for analysis
- copy important analysis products back to corral
- or $WORK
- Periodically back up corral or $WORK directories to ranch tape archive
- On a UT Biomedical Research Support Facility (BRCF) "POD"
- See https://wikis.utexas.edu/display/RCTFusers
- Home and Work areas on POD servers are automatically backed up weekly
- and archived to ranch every 4-6 months
- Home and Work areas on POD servers are automatically backed up weekly
- GSAF customers can obtain a no-cost 2 TB allocation on the shared GSAF POD
- See https://wikis.utexas.edu/display/RCTFusers
backup analysis artifacts regularly
- Obtain an allocation on All TACC users automatically have a 2 TB allocation TACC's ranch tape archive system
- 10 TB a good initial numberlarger allocations can be requested by project owners in the TACC User Portal
- free! and under-utilized
- Periodically back up your corral or $WORK directories to ranch tape archive
- large directories should be combined first using the tar program
- large directories should be combined first using the tar program
distinguish between types of data
Artifacts from different stages of the analysis will have different archival requirements.
- Original sequence data (fastq FASTQ files)
- must be backed up!
- Alignments
- usually larger than original fastq FASTQs
- can be backed up once stable
- Downstream analysis artifacts
- Reporting artifacts (plots, plotting code)
While a project is active you will want to keep more intermediate artifacts for reference. Many of these can be deleted removed after publication.
track your analysis steps
...
- Keep "work files" that detail analysis steps performed
- here's an and Example alignment work file
...