Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

NGS is smack dab in the middle of the Big Data revolution. Initial NGS fastq FASTQ files are big (100s of MB to GB) – and they're just the start.

...

  • Most sequencing facilities will give you compressed sequencing data files
    • gzip format (.gz extension) for individual files
    • tar or zip format for directories of files
  • Even with compression it's easy to run out of storage space!

You may be tempted un-compress decompress your sequencing files to manipulate them more directly

  • resist the temptation to gunzip!
  • nearly all modern bioinformatics tools are able to work on .gz files
  • there are techniques for working with compressed files without ever un-compressing decompressing them

arrange adequate storage space

  • At TACC
    • Obtain an allocation on TACC's corral disk array (initial 5 TB are no-cost)
    • Stage your active projects on corral
     
    • or $WORK
      • copy data to
      $WORK or
      • $SCRATCH for analysis
      • copy important analysis products back to corral
       
      • or $WORK
    • Periodically back up corral or $WORK directories to ranch tape archive
  • On a UT Biomedical Research Support Facility (BRCF) "POD"
    • See https://wikis.utexas.edu/display/RCTFusers
      • Home and Work areas on POD servers are automatically backed up weekly
        • and archived to ranch every 4-6 months
    • GSAF customers can obtain a no-cost 2 TB allocation on the shared GSAF POD

backup analysis artifacts regularly

  • Obtain an allocation on All TACC users automatically have a 2 TB allocation TACC's ranch tape archive system
    • 10 TB a good initial numberlarger allocations can be requested by project owners in the TACC User Portal
    • free! and under-utilized
  • Periodically back up your corral or $WORK directories to ranch tape archive
    • large directories should be combined first using the tar program

distinguish between types of data

Artifacts from different stages of the analysis will have different archival requirements.

  • Original sequence data (fastq FASTQ files)
    • must be backed up!
  • Alignments
    • usually larger than original fastq FASTQs
    • can be backed up once stable
  • Downstream analysis artifacts
  • Reporting artifacts (plots, plotting code)

While a project is active you will want to keep more intermediate artifacts for reference. Many of these can be deleted removed after publication.

track your analysis steps

...

...