Data wrangling best practices

NGS is smack dab in the middle of the Big Data revolution. Initial NGS FASTQ files are big (100s of MB to GB) – and they're just the start.

Organization and good practices are critical! Your data can get out of hand very quickly!

keep fastq files compressed

You may be tempted decompress your sequencing files to manipulate them more directly

arrange adequate storage space

backup analysis artifacts regularly

distinguish between types of data

Artifacts from different stages of the analysis will have different archival requirements.

While a project is active you will want to keep more intermediate artifacts for reference. Many of these can be removed after publication.

track your analysis steps

Your analyses should be reproducible by others so you need to keep the equivalent of a lab notebook to document your protocols.