# Suggested Workflow for Storing a Variable Set of Dataframes under Version Control

## Introduction

This vignette describes a suggested workflow for storing a snapshot of dataframes as git2rdata objects under version control. The workflow comes in two flavours:

1. A single repository holding both the data and the analysis code. The single repository set-up is simple. A single reference (e.g. commit) points to both the data and the code.
2. One repository holding the data and a second repository holding the code. The data and the code have an independent history under a two repository set-up. Documenting the analysis requires one reference to each repository. Such a set-up is useful for repeating the same analysis (stable code) on updated data.

In this vignette we use a git2r::repository() object as the root. This adds git functionality to write_vc() and read_vc(), provided by the git2r package. This allows to pull, stage, commit and push from within R.

Each commit in the data git repository describes a complete snapshot of the data at the time of the commit. The difference between two commits can consist of changes in existing git2rdata object (updated observations, new observations, deleted observations or updated metadata). Besides updating the existing git2rdata objects, we can also add new git2rdata objects or remove existing ones. We need to track such higher level addition and deletions as well.

We illustrate the workflow with a mock analysis on the datasets::beaver1 and datasets::beaver2 datasets.

## Setup

We start by initializing a git repository. git2rdata assumes that is already done. We’ll use the git2r functions to do so. We start by creating a local bare repository. In practice we will use a remote on an external server (GitHub, Gitlab, Bitbucket, …). The example below creates a local git repository with an upstream git repository. Any other workflow to create a similar structure is fine.

# initialize a bare git repo to be used as remote
remote <- tempfile("git2rdata-workflow-remote")
remote <- normalizePath(remote, winslash = "/")
#> Warning in normalizePath(remote, winslash = "/"): path[1]="/tmp/Rtmp5wrPiD/
#> git2rdata-workflow-remote151b553b1f25": Bestand of map bestaat niet
dir.create(remote)
git2r::init(remote, bare = TRUE)

# initialize a local git repo
path <- tempfile("git2rdata-workflow")
path <- normalizePath(path, winslash = "/")
#> Warning in normalizePath(path, winslash = "/"): path[1]="/tmp/Rtmp5wrPiD/git2rdata-
#> workflow151b3b10f81b": Bestand of map bestaat niet
dir.create(path)
init_repo <- git2r::clone(remote, path, progress = FALSE)
git2r::config(init_repo, user.name = "me", user.email = "me@me.com")
# add an initial commit with .gitignore file
writeLines("*extra*", file.path(path, ".gitignore"))
git2r::commit(init_repo, message = "Initial commit")
#> [12615d9] 2021-01-20: Initial commit
# push initial commit to remote
rm(init_repo)

## Structuring Git2rdata Objects Within a Project

git2rdata imposes a minimal structure. Both the .tsv and the .yml file need to be in the same folder. That’s it. For the sake of simplicity, in this vignette we dump all git2rdata objects at the root of the repository.

This might not be good idea for real project. We recommend to use at least a different directory tree for each import script. This directory can go into the root of a data repository. It goes in the data directory in case of a data and code repository. Or the inst directory in case of an R package.

Your project might need a different directory structure. Feel free to choose the most relevant data structure for your project.

## Storing Dataframes ad Hoc into a Git Repository

### First Commit

In the first commit we use datasets::beaver1. We connect to the git repository using repository(). Note that this assumes that path is an existing git repository. Now we can write the dataset as a git2rdata object in the repository. If the root argument of write_vc() is a git_repository, it gains two extra arguments: stage and force. Setting stage = TRUE, will automatically stage the files written by write_vc().

library(git2rdata)
repo <- repository(path)
fn <- write_vc(beaver1, "beaver", repo, sorting = "time", stage = TRUE)

We can use status() to check that write_vc() wrote and staged the required files. Then we commit() the changes.

status(repo)
#> Staged changes:
#>  New:        beaver.tsv
#>  New:        beaver.yml
cm1 <- commit(repo, message = "First commit")
beaver2$beaver <- 2 body_temp <- rbind(beaver1, beaver2) fn <- write_vc(x = body_temp, file = file.path(data_path, "body_temperature"), root = repo, sorting = c("beaver", "time"), stage = TRUE) # step 4: remove any dangling metadata files prune_meta(root = repo, path = data_path, stage = TRUE) # step 5: commit the changes cm <- commit(repo = repo, message = "import", session = TRUE) # step 5b: sync the repository with the remote push(repo = repo) } ## Analysis Workflow with Reproducible Data The example below is a small trivial example of a standardized analysis in which documents the source of the data by describing the name of the data, the repository URL and the commit. We can use this information when reporting the results. This makes the data underlying the results traceable. analysis <- function(ds_name, repo) { ds <- read_vc(ds_name, repo) list( dataset = ds_name, repository = git2r::remote_url(repo), commit = recent_commit(ds_name, repo, data = TRUE), model = lm(temp ~ activ, data = ds) ) } report <- function(x) { knitr::kable( coef(summary(x$model)),
caption = sprintf("**dataset:** %s  \n**commit:** %s  \n**repository:** %s",
x$dataset, x$commit$commit, x$repository)
)
}

In this case we can run every analysis by looping over the list of datasets in the repository.

repo <- repository(path)
current <- lapply(list_data(repo), analysis, repo = repo)
names(current) <- list_data(repo)
result <- lapply(current, report)
junk <- lapply(result, print)
dataset: beaver
commit: 3863f2ec468e5a5d4613cfe8d0aab62b87502df0
repository: /tmp/Rtmp5wrPiD/git2rdata-workflow-remote151b553b1f25
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.9084247 0.0198546 1858.93938 0
activ 0.9346636 0.0352219 26.53644 0

The example below does the same thing for the first and second commit.

# checkout first commit
git2r::checkout(cm1)
# do analysis
previous <- lapply(list_data(repo), analysis, repo = repo)
names(previous) <- list_data(repo)
result <- lapply(previous, report)
junk <- lapply(result, print)
dataset: beaver
commit: f1bd46a8d3d13361a2cdf1ff21e52431e90bfb51
repository: /tmp/Rtmp5wrPiD/git2rdata-workflow-remote151b553b1f25
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.8421296 0.0167694 2196.987569 0e+00
activ 0.3812037 0.0730961 5.215107 8e-07
# checkout second commit
git2r::checkout(cm2)
# do analysis
previous <- lapply(list_data(repo), analysis, repo = repo)
names(previous) <- list_data(repo)
result <- lapply(previous, report)
junk <- lapply(result, print)
dataset: beaver
commit: f1bd46a8d3d13361a2cdf1ff21e52431e90bfb51
repository: /tmp/Rtmp5wrPiD/git2rdata-workflow-remote151b553b1f25
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.8421296 0.0167694 2196.987569 0e+00
activ 0.3812037 0.0730961 5.215107 8e-07
dataset: extra_beaver
commit: 16dc4532b8cf0b6869e83ab9c4f8b35194c123ac
repository: /tmp/Rtmp5wrPiD/git2rdata-workflow-remote151b553b1f25
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.0968421 0.0345624 1073.32955 0
activ 0.8062224 0.0438943 18.36736 0

If you inspect the reported results you’ll notice that all the output (coefficients and commit hash) for “beaver” object is identical for the first and second commit. This makes sense since the “beaver” object didn’t change during the second commit. The output for the current (third) commit is different because the dataset changed.

### Long running analysis

Imagine the case where an individual analysis takes a while to run. We store the most recent version of each analysis and add the information from recent_commit(). When preparing the analysis, you can run recent_commit() again on the dataset and compare the commit hash with that one of the available analysis. If the commit hashes match, then the data hasn’t changed. Then there is no need to rerun the analysis1, saving valuable computing resources and time.

1. assuming the code for running the analysis didn’t change.↩︎