Aggregating Average Treatment Effects with Baggr

2019-07-29

This vignette is - like the package itself - still under construction. We encourage your feedback.

baggr (pronounced “bagger” or “badger” and short for Bayesian Aggregator) is a package for aggregating evidence on causal effects measured in several separate and different instances. These instances may be different studies, groups, locations or “sites” however conceptualised. We refer to these separate pieces of evidence as “groups” for the remainder of this vignette When each group is a study, the model is that of meta-analysis, but aggregation of evidence is not limited to this case.

One of the most basic objects of interest is the average treatment effect, the difference in the mean outcome in treatment and control groups; for more information see work by Rubin (1974). In meta-analysis we are often interested in the average of this average effect across groups, estimated using all the evidence from all the groups. Consider the case where the evidence in each study or group is generated by comparing the outcomes of treatment and control samples in a randomized experiment. We will ignore any covariate information at the individual or group level for now.

Consider some outcome of interest $$y_{ik}$$ such as consumption, income or health outcomes for a household or individual $$i = 1,2,...N_k$$ in study group $$k = 1,2....K$$. Let $$Y_k$$ denote the $$N_k$$-length vector of observed outcomes from group $$k$$. Denote the binary indicator of treatment status by $$T_{ik}$$, and denote by $$T_k$$ the $$N_k$$-length vector of all treatment status indicators from group $$k$$.

Suppose that $$y_{ik}$$ varies randomly around its mean $$\mu_k + \tau_k T_i$$. In this setting $$\tau_k$$ is the treatment effect in group $$k$$. The random variation in $$y_{ik}$$ may be the result of sampling variation or measurement error, as in the Rubin (1981) model, or it may be the result of unmodelled heterogeneity or uncertainty in outcomes for individuals within the group. Allow the variance of the outcome variable $$y_{ik}$$ to vary across sites, so $$\sigma_{y_k}^2$$ may differ across $$k$$.

Data inputs: reported effects or full individual-level data sets

For average effects aggregation, baggr allows 3 types of data inputs. The user may supply, within a data frame environment, any of the following:

1. A set of estimated treatment effects $$\{\hat{\tau_k}\}_{k=1}^{K}$$ and their standard errors $$\{\hat{se_k}\}_{k=1}^{K}$$ from each study. This should be formatted as two column vectors of length $$K$$ within the data frame, where $$\hat{\tau_k}$$ is the $$k$$-th entry of the treatment effect vector and $$\hat{se_k}$$ is the $$k$$-th entry of the standard errors vector. Columns should be names “tau” and “se”.

2. A set of control group means and estimated treatment effects $$\{\hat{\mu_k},\hat{\tau_k}\}_{k=1}^{K}$$, as well as the standard errors for both $$\{\hat{se}_{\mu k}, \hat{se}_{\tau k}\}_{k=1}^{K}$$, for each study site This should be formatted as four vectors of length $$K$$ within the data frame, analogous to the above. Columns should be names “mu”, “tau”, “se.mu”, “se.tau”.

3. The full data sets from all the original studies $$\{Y_k, T_k\}_{k=1}^{K}$$. This should be formatted as three vectors of length $$\sum_{k=1}^K N_{k}$$, which we recommend naming “outcome”, “treatment”, “group” (for site indicators), but the names can also be specified when calling baggr() function.

As an example of an individual-level data set we include in data frames microcredit and microcredit_simplified. The former contains all the microcredit outcome data used in Meager (2019), standardized to USD PPP in 2009 terms per two weeks (a time period is necessary as these are flow variables). It therefore contains NAs and other undesirable features, to allow the user to see how baggr handles these common data issues. The data set microcredit_simplified has these issues cleaned up and contains only one outcome of interest, consumer durables spending.

baggr also has a function that automatically produces summary data from full data sets, in case one wishes to run the comparatively faster summary-data models. The “prepare_ma()” function applied to a dataframe with columns named “group”, “outcome”, and “treatment” automatically estimates the control group means, treatment effects, and associated standard errors for each group using an Ordinary Least Squares regression. The resulting output is already formatted as a valid input to the baggr command itself:

colnames(microcredit_simplified) <- c("group", "outcome", "treatment")
prepare_ma(microcredit_simplified)
#>   group          mu       tau       se.mu     se.tau
#> 1     1    5.369612  1.074862   0.3226702   0.425339
#> 2     2 1156.243056 -6.503679 146.2217092 191.120776
#> 3     3   23.868992  4.440776   1.2379155   2.207257
#> 4     4    6.516936  1.377748   1.0043951   2.271460

ATE aggregation models in baggr

baggr currently contains two different models suitable for aggregating sets of average treatment effects. Consider first the evidence aggregation model from Rubin (1981), discussed extensively in Chapter 5 of Gelman et al. (2013); the model consists of a hierarchical likelihood as follows:

\begin{aligned} \hat{\tau_k} &\sim N(\tau_k, \hat{se_k}^2) \; \forall \; k \\ \tau_k &\sim N(\tau, \sigma_{\tau}^2) \; \forall \; k . \end{aligned} \label{rubin model}

The motivation for this model structure is discussed in detail in the sources above and in Meager (2019). To complete the Bayesian model, we now need priors. baggr has a set of default priors for each model as well as allowing the user to specify her own priors if desired. In the Rubin model, baggr’s default priors on the hyper-parameters are as follows: for $$\tau$$, the prior is Normal with mean 0 and variance 1000. This is a very weak prior which does little regularization as a default, centered at zero following the basic philosophical approach that causal effects should not be thought of as large unless data contains evidence to the contrary (a “hand-wavey” form of Occam’s Razor). For $$\sigma_{\tau}$$ the prior is Uniform on an interval from zero to 10 times the simple or naive variance estimator of the vector $$\{\hat{\tau_k}\}_{k=1}^{K}$$ generated by the R command var().

In case you also have data on the control groups’ mean outcomes and the uncertainty on those, it would make sense to augment the Rubin (1981) model to incorporate that information. Following Meager (2019), if one has access to the estimated control means $$\{\hat{\mu_k}\}^K_{k=1}$$ and their standard errors $$\{\hat{se}_{\mu k}\}^K_{k=1}$$, one can fit a jointly Normal model on the pairs $$\{\hat{\mu_k},\hat{\tau_k}\}_{k=1}^{K}$$. baggr implements the model from Meager (2019) which it calls the “mutau” model, like so:

\begin{aligned} \hat{\tau_k} &\sim N(\tau_k, \hat{se_{\tau k}}^2) \; \forall \; k \\ \hat{\mu_k} &\sim N(\mu_k, \hat{se_{\mu k}}^2) \; \forall \; k \\ \left( \begin{array}{c} \mu_{k}\\ \tau_{k} \end{array} \right) &\sim N\left( \left( \begin{array}{c} \mu\\ \tau \end{array} \right), V \right) \; \text{where} \;V = \left[ \begin{array}{cc} \sigma^2_{\mu} & \sigma_{\tau\mu} \\ \sigma_{\tau\mu} & \sigma_{\tau}^2 \end{array} \right]\forall \; k. \\ \end{aligned} \label{full data model}

If you have only few groups, the priors on $$V$$ will need to be relatively strong to avoid overfitting. See Meager (2019) for more discussion of this issue in particular, or see the Stan Manual on hierarchical priors. Currently $$V$$ is decomposed into a correlation matrix $$\Omega$$, which receives an LKJCorr(3) prior regularizing it towards independence, and a scale parameter $$\theta$$ which receives a Cauchy(0,10) prior. This will be something the user can change in later versions of baggr. The $$mu$$ and $$\tau$$ parameters have a joint gaussian prior which the user can specify, but which by default is centered at 0. The centering can be changed to any vector of length 2 by passing this vector to baggr labelled as “prior_tau_mean”). The default covariance matrix of this prior is independent with variance $$1000^2$$, but the user can pass any prior covariance matrix to baggr labelled as “prior_tau_scale”.

Models, their inputs, likelihood and priors: a summary table

Model Data frame input columns Level-1 likelihood Level-2 likelihood Priors
“Rubin” tau and se $$\hat{\tau_k} \sim N(\tau_k, \hat{se_k}^2)$$ $$\tau_k \sim N(\tau, \sigma_{\tau}^2)$$ $$\tau \sim \mathcal{N}(0, \sigma^2)$$, $$\sigma_{\tau} \sim \mathcal{U}(0, u)$$
$$\mu$$ and $$\tau$$ tau, mu, se.tau, se.mu $$\hat{\tau_k} \sim N(\tau_k, \hat{se_{\tau,k}}^2)$$, $$\hat{\mu_k} \sim N(\mu_k, \hat{se_{\mu,k}}^2)$$ $$\pmatrix{\mu_k \\ \tau_k} \sim N(\pmatrix{\mu \\ \tau}, V)$$ $$V = \theta \Omega \theta'$$ where $$\theta \sim Cauchy(0,10)$$, $$\Omega \sim LKJCorr(3)$$, $$\pmatrix{\mu \\ \tau} \sim N(0,\Xi )$$
Full data outcome, treatment, group Same as for “$$\mu$$ and $$\tau$$ Same as for “$$\mu$$ and $$\tau$$ Same as for “$$\mu$$ and $$\tau$$
Quantiles outcome, treatment, group See Meager (2019) See Meager (2019) See Meager (2019)

Running the Rubin Model in baggr

To demonstrate the Rubin model in baggr, consider the 8 schools example from Rubin (1981). In this dataset, 8 schools in the United States performed similar randomized experiments to estimate the causal effect of an SAT tutoring program on learning outcomes. Reported treatment effects and their standard errors are included in baggr as a data frame:

schools
#>      group tau se
#> 1 School A  28 15
#> 2 School B   8 10
#> 3 School C  -3 16
#> 4 School D   7 11
#> 5 School E  -1  9
#> 6 School F   1 11
#> 7 School G  18 10
#> 8 School H  12 18

To fit the model in baggr (having followed the installation instructions and loaded the package):

baggr_schools <- baggr(schools, model = "rubin", pooling = "partial")

Cross-validation in baggr

[Automated cross-validation will be included in the next package release. For this version only the “manual” method is available, as presented at the end of this section.]

Baggr has built-in, automated leave-one-out cross validation for its models, where the “one” always refers to one group. This naturally corresponds to the question of how much any one group has contributed to the overall inference. Therefore, if you have $$K$$ groups, using the loocv() will run your model of choice $$K$$ times. Be aware that this may take a while even for simple models. (You should see a progress bar in your terminal window.) The loocv() function takes in the same arguments as baggr(), plus an option (return_models) for whether to return all the models or just the summary statistics. For the 8 schools example we can do

loocv_res <- loocv(schools, return_models = FALSE, "rubin", pooling = "partial")
loocv_res
#> [1] "log predictive density =  63.51933"

The main output is $$-2$$ times the log predictive density (lpd) summed over 8 models, which corresponds to the Watanabe-Akaike Information Criterion. A WAIC value closer to zero (i.e. a smaller number in magnitude) means a better fit.

More data are stored in loocv() output, and can be accessed via attributes(), e.g. the mean treatment effects, their variability and lpd for each model that are stored in the attribute df:

names(attributes(loocv_res))
#> [1] "names" "class"
attr(loocv_res, "df")
#> NULL

This data frame can then be used to examine or compute the variation in the inference on $$\tau$$ in the absence of each group. If the user is interested in manually checking the consequences of excluding a particular group or set of groups, this is also possible in baggr using subsetting. For example, suppose that we want to run the Rubin model on school groups 1-7 and predict the effect in the 8th school. The code below shows how you can specify a subset of the dataframe as your “data” argument, and then designate another subset as the “testing” holdout set by assigning it to the argument “test_data” in the baggr command. Here we have done it for both partial and full pooling:

fit1 <- baggr(data = schools[1:7,], test_data = schools[8,], model = "rubin", pooling = "partial")
fit2 <- baggr(data = schools[1:7,], test_data = schools[8,], model = "rubin", pooling = "full")

We can compare the performance of the two models using the mean log predictive density. This itself is a density as the name suggests so we will compute the log expected value of this density: here, as before, a number closer to zero is better. In this case the full pooling model actually does slightly better:

fit1$mean_lpd #> [1] 7.942566 fit2$mean_lpd
#> [1] 7.744134

References

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. CRC Press.

Gelman, Andrew, and Iain Pardoe. 2006. “Bayesian Measures of Explained Variance and Pooling in Multilevel (Hierarchical) Models.” Technometrics 48 (2): 241–51. https://doi.org/10.1198/004017005000000517.

Meager, Rachael. 2019. “Aggregating Distributional Treatment Effects: A Bayesian Hierarchical Analysis of the Microcredit Literature.” https://doi.org/10.31222/osf.io/7tkvm.

Rubin, Donald B. 1974. “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.” Journal of Educational Psychology. https://doi.org/10.1037/h0037350.

———. 1981. “Estimation in Parallel Randomized Experiments.” Journal of Educational Statistics 6 (4): 377–401.