You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Your computer has a directory structure with a base directory and sub-directories. You might have a sub-directory for each project you are working on and, within that directory a sub-directory for each paper within that project. For your research you will need to store many different kinds of files including documentation and notes, original data, created data, scripts for creating variables, scripts for data analysis, log files, and results files. The specific organization might be different for each project and even for each collaborator on a project. Nonetheless, you should have a plan for organizing and documenting the location of each element. One approach might be to create a sub-directory for each kind of document. Further, naming directories something usefully descriptive can make it easier for you to find the files that you want. The specific organization will likely be shaped by how you are using cloud storage (see below). 

The set of directories that you will likely need for each paper/project include:

Original Data – ideally everyone on the project has access to the exact same data. The best way to make sure this is true is to be working from the same copy of the data and this means that you will use cloud storage or a network drive to to house the data. Definitely write-protect this directory. It is also a good idea to include a document in this directory describing where the data came from and when they were downloaded. An alternative might be to keep the original data in your GitHub repository. I don't do that because the data I use often includes large data files and I don't want a bloated repository. 

Text files – You'll want a place where you store drafts of the manuscript. This is also where I store the Project Master Document described in Documentation. Store backups of note here but the active version is on goodnotes on my iPad.  This directory is also in a space that collaborators can reach. Some time in the future I might start to store the drafts of the manuscript and other notes in by GitHub Repository and that will mean that I would NOT store these files in a directory that can be accessed by collaborators. Instead, collaborators would pull these files from the repository. 

Scripts for processing and analyzing data – These are archived using GitHub. Note that each project has a setup script that calls a personal setup script for each collaborator. The personal setup script sets macros for the location of each of the directories on the project. There are some examples of this in the tech tips (but it needs some development). 

Created Data – these are data files that I will use in the analysis. Often they take a long time to create and I don't want to have to rerun all of the code that creates the data every time I do an analysis. So, I have a directory for created data files. I do not share this directory with any project members. Each one of us has our own copy of the created data that we create using the scripts that are housed in our repository. This makes it so that each project member is replicating the results independently. To make sure we stay in sync, I will typically pull all changes from the repository when I start work for the day and rerun the code to create the analysis data files. Then I might comment out the portion of my main do file that runs the code to create the data for the rest of the day to save time. 

Results – this is where the automated analysis scripts put the results, including output from markdown files. I do not share this directory with collaborators so that each person on the project replicates the results independently.  This practice aligns with suggestions in Christensen, Freese, and Migel's Transparent and Reproducible Social Science Research to not save statistical output but instead save the data and code that produces that output. Of course, our current systems require that you have at least one copy of your results to submit for peer review. 

Log files – I put this in a directory separate from results just to keep my results directory relatively clean (but it is still messy). Log files might have dates in their names, but that will mean storing a lot of log files. This directory is not shared and it's not in the repository. 

Naming files

Some people advocate for including a descriptive prefix for all files related to a paper or project. For example, I might have a project investigating trends in divorce and so all files related to that paper should start with "div." I once used this approach but over time I've decided that it is a waste of characters in the name of the file.  The directory name tells me what project the file is for. Instead, I give files a name that describes the role of the file on the project, like PAA_ExtAbstract. This extends to the scripts that analyze the data. These are also stored in a project (or paper) specific directory and are named according to function rather than the paper topic. If you want to share a file you can point someone to your GitHub repository for the project and so they have the file in context. 

You might develop conventions about what to call each piece of the documentation for your project. Such as

Project Master Document - "master.docx" or "master.txt".

Notebook - "RunningNotes.txt"

Code Master Document – README.txt of README.md. GitHub will kindly create one of these when you initiate the repository.

For a current project we are naming our analysis files for the part of the paper that they produce. Doing that hasn't worked great because we have many files named Table 2XXX and only one actually produces the real Table 2. It's true that the XXX part of the name gives us some clues as to which file is likely to produce the actual Table 2, but it has been a source of mild confusion on the project. So, I don't (yet) recommend this approach. 

Cloud Storage

On another page I discuss the different kinds of cloud storage, but this relates to your file structure as well. With the File Structure described above, I keep the Original Data and text files in Box or on a network drive that my collaborators can access. I keep scripts for processing and analyzing data in the GitHub repository. Created Data, Results, and Log Files are all stored in my own personal directories and not shared with my collaborators who are expected to independently replicate results. 

  • No labels