Organising Shareable R Projects
Some advice for organising code for projects in linguistics using RStudio.
A complex research project can easily get out of hand organisationally. Sometimes we end up with multiple folders with different data sets, models, plots, scripts, and R Markdown files. This can become deeply confusing as we work. We want our projects to be well structured.
We also want our work to be shareable. Often newcomers to R produce files
which depend on their own directory structure. A tell tale sign is a line of code like:
my_data <- read_rds('/home/jane_doe/Documents/my_project/data/third_attempt/data_2.rds).
This will cause problems for anyone trying to run the code on a different
The demand for well structured and shareable projects is doubly true if we want to make our data and analysis public by means of a GitHub or GitLab repository.
Having read a few blog posts and text book sections, I’ve ended up fixing on the following set up for my projects at NZILBB. This post assumes you are using RStudio.
The structure I recommend looks like this:
project_name | README.md | project_name.Rproj | └─── data | | raw_data.csv | | processed_data.rds | | ... | └─── supplementaries | | preprocessing.Rmd | | modelling.Rmd | | ... | └─── presentations | └─── conference_name | | presentation.Rmd | | ... | └─── write_up | | write_up.Rmd | | ... | └─── scripts | | anonymisation.R | | ... | └─── models | | model_name.rds | | ... | └─── plots | data_distribution.png | model_predictions.png | ...
The names of the files inside the folders above are illustrative examples.
In a file explorer, It’ll look like this:
The idea is to have a folder for each project. Within that folder, you will
always have a
data folder within which you should store all of your data
files. Sometimes I distinguish between a raw data folder and one with data
which has been processed as part of the project. This is not necessary though.
The majority of the R code which you will create will be in the
scripts folders. I like to keep these separate, rather than having a single
scripts folder for all
.Rmd files. This is because I tend to end up
with lots of small scripts which respond to small projects which occur in the
course of a project or which are prototypes of an analysis which will eventually
appear in an R markdown document. The
scripts folder can hold this,
potentially messy, material. The
supplementaries on the other hand, will be
the main R markdown file containing the analysis code and any descriptions
needed for someone to make sense of the steps you have carried out.
The R markdown files in the
supplementaries folder will likely include much
more than you report in any papers you produce from the project. In order to
avoid unwieldy R markdown files which take a long time to knit, I like to have
a distinct file for each major step in the project.
We often save plots in the course of our analysis. It is useful to have these
plots folder. These can then be
independently shared or used with other programmes.
In projects which handle large data sets, it is often wise to save any models we
have fit to the data. These models can sometimes take hours to fit. This avoids
wasting time refitting models. We put these files in the
The remaining folders in the structure above are optional. I often create
presentations using the
ioslides packages. I store these in a
presentations directory. Sometimes I include a
write_up folder for any
journal articles or other written outputs which might come from the project.
within the project folder, you should have a file called
README.md. This is
a markdown file which should explain what your project is, where things
are stored within the project, and how the project can be interacted with.
If you use GitHub or GitLab, the content of this file will be the first thing
that any potential user will see of your project.
What about the
project_name.Rproj file? We’ll turn to that now.
R Project Files and the
The main problem which R project files and the
here package solve is file
There are two options for starting a new project in RStudio: starting a project in an already existing directory or by creating a new directory. Both options will be given to you if you open RStudio and click the ‘project’ button at the top right of your screen and select ‘New Project…’. Either option is perfectly fine.
Once you have created a project you should see the name you have chosen at the
top right of the screen. Your project folder, whether it already existed or you
just created it, will also have a file with your project name and the extension
.Rproj. This is the R project file. You can use this to open your project
(just double click on it).
Once the project is opened, your working directory will be the project folder.
This is already an improvement on getting and setting working directories
with absolute file paths at the start of an R script. Anyone you send the
project to will, if they open through the project file, be in the right place
on their own machine to run your code. If you use relative file paths, this
is a big improvement. That is, if you use
read_csv('data/my_data.csv') from an R script in your project folder,
this will work for anyone who you share your project with.
here package takes us even further, though.
As Jenny Richmond sets out,
working directories work a little differently for R scripts and R markdown files.
If you have a script and an Rmd file in
project_name/markdown, the working
directory for the R script will be the project folder, while the working directory
for the R markdown file will be
project_name/markdown. This means, for instance,
that code which works in your markdown file will not work in the R console. It’s
also something which you would have to keep in your head when moving between R
scripts and R markdown files. This is not the kind of thing you want to be
If you load the package
here when working in a project, you can refer to files
within your project in a consistent way which does not depend on anything
specific to your own computer. For example:
library(here) my_data <- read_csv(here('data', 'my_data.csv'))
This code will load the package
here, which looks ‘up’ the directory structure
until it finds the R project file. We then use the function
here to refer to
files within our project. The arguments to
here are either folders or files.
A number of arguments specifying folders are followed by an argument with the
name of the specific file to be loaded. In this case,
'data' is the first
argument, which says we are looking in the data folder inside our project
folder. The second argument,
'my_data.csv' says that we are looking for a file
with that name inside the folder specified by the previous argument.
To belabour the point a bit: if the file we want to load is two folders deep,
say, the file
my_data.csv is inside a folder called
data_source_1 which is
in turn inside out
data folder, then we would use
You can have as many arguments as you like when using the
A final note: in a footnote to the
here package documentation it is suggested
library(here) and instead use
here::here() whenever you use the
here function. I can see that this avoids clashes with the
plyr package, but I don’t see this as likely to come up in any of my
work or in NZILBB projects.
There are two main code sharing scenarios:
- Sharing a whole project.
- Sharing a small portion of a project.
In the former case, the best option is Git, along with either GitHub or GitLab. This does have a bit of a learning curve (but it’s not insurmountable)!
Otherwise, for either case, use whatever method you’d usually use to share files with someone. Either send the whole project folder, or the project folder without unnecessary files. Sometimes, it might be easier to start a new project and create a minimal working example of whatever it is you want to share.
- Bodo Winters, Statistics for Linguists: An Introduction Using R
- See Chapter 2, which covers the core
tidyversepackages with a worked example from linguistics and introduces a folder structure similar to the one suggested here in Section 2.8. Winters doesn’t cover the
- See Chapter 2, which covers the core
- Hadley Wichkam, Advanced R, Chapter 8 Workflow: Projects.
- Some advice for setting up Rstudio to encourage good habits and reasons for using R projects to organise code.
- Jenny Richmond, ‘how to use the
- Malcolm Barrett, ‘Why should I use the here package when I’m already using projects?’
If you want to know more about Git and Github for code sharing and version management check out Happy Git and GitHub for the useR