Page History

...

Manuscript files – You'll want a place where you store drafts of the manuscript. This is also where I store the Project Master Document described in Documentation. Store backups of notes here but the active version is on goodnotes on my iPad. This This directory is also in a space that collaborators can reach. Some time in the future I might start to store the drafts of the manuscript and other notes in by GitHub Repository and that will mean that I would NOT store these files in a directory that can be accessed by collaborators. Instead, collaborators would pull these files from the repository. , but that works only if my manuscripts and notes are text files and not stored as word documents.

Scripts for processing and analyzing data – These are archived using GitHub. Note that each project has a setup script that calls a personal setup script for each collaborator. The personal setup script sets macros for the location of each of the directories on the project. There are some examples of this in the tech tips (but it needs some development).

...

I also recommend that you keep your scripts in a directory that is linked to a code repository that is backed up to the cloud, like github. I discuss the benefits of using a code versioning system elsewhere. Having your scripts in a repository has important implications related to file structure. First, you will want to avoid putting your data in the code repository, unless your data file is very small and will not change. To remove the temptation, you could either add the data to the git ignore list or keep your data separate from the git repository. Second, keep the products of your code out of the repository as well. One reason why you want to keep the original data and analytical products out of the repository has to do with how versioning systems store your changes. In short, git compares the contents of the old version with the new and stores the difference. If the files are binary (e.g. a stata data file, excel file, word document, image) git is unable to usefully make the comparison and will store the whole file. Overtime this will make your repository unwieldy. Another reason is that the data products can get out of sync with the code if some members of the research team are committing only the scripts, which I’m arguing is the better practice. Trust the process. If you need to go back to an earlier research product, you can checkout the version of the repository that produced that product and re-run the code to reproduce it. If your code cannot produce the data product, then the product isn’t documented and the system isn’t working as designed.

Third, it is possible to use directories Use subdirectories within your main project script directory to organize your code. (for example, this project has subdirectories for the R. and stata versions of the file organization scripts). When your moving files across directories, use git mv rather than your operating system to move the files. In this way the history of the file is preserved and git wont reinstall the old version of the file in its old place when you sync-up your repository to the cloud.

An example of a setup for reproducible results is here: https://github.com/kraley/workflow (currently only for stata, but I'm working on a version for R).

Naming files

Some people advocate for including a descriptive prefix for all files related to a paper or project. For example, I might have a project investigating trends in divorce and so all files related to that paper should start with "div." I once used this approach but over time I've decided that it is a waste of characters in the name of the file. The directory name tells me what project the file is for. Instead, I give files a name that describes the role of the file on the project, like PAA_ExtAbstract. This extends to the scripts that analyze the data. These are also stored in a project (or paper) specific directory and are named according to function rather than the paper topic. If you want to share a file you can point someone to your GitHub repository for the project and so they have the file in context.

...

On another page I discuss the different kinds of cloud storage, but this relates to your file structure as well. With the File Structure described above, I keep the Original Data and text files manuscript files in Box or on a network drive that my collaborators can access. I keep scripts for processing and analyzing data in the GitHub repository. Created Data, Results, and Log Files are all stored in my own personal directories and not shared with my collaborators who are expected to independently replicate results.

Page tree

Versions Compared

Old Version 15

New Version Current

Key

An example of a setup for reproducible results is here: https://github.com/kraley/workflow (currently only for stata, but I'm working on a version for R).

Naming files