Friday, September 2, 2016

Incorporating previous study results in your model

Many researchers appear to want to use Bayesian analysis when analysing their experiments, but many do not actually carry this out, possibly because of perceptions that doing so is difficult. There is nothing wrong with that. It is, after all, possible to have a Bayesian outlook on the world—updating your priors about how the world works when new evidence comes to light—without carrying out explicit Bayesian modeling. But it would be better if analysts used tools consistent with their beliefs.
In this post I’ll show you how to estimate an experimental treatment effect using linear regression, while incorporating prior information from previous studies. Rather than doing this in stages (estimating the treatment effect and then doing some meta-analysis), we’ll do everything in one pass. This has the advantage of helping us to get more precise estimates of all our model parameters.

A very basic underlying model

Let’s say that we run the 'th experiment estimating the treatment effect of some treatment  on an outcome . It’s an expensive and ethically challenging experiment to run, so unfortunately we’re only able to get a sample size of 20. For simplicity, we can assume that the treatment has the same treatment effect for all people,  (this is easily dropped in more elaborate analyses). There have been  similar experiments run in the past. In this example our outcome data  are conditionally normally distributed with (untreated) mean  and standard deviation . There is nothing to stop us from having a far more complex model for the data. So the outcome model looks like this:

The question is: how can we estimate the parameters of this model while taking account of the information from the  previous studies? The answer is to use the so-called hierarchical prior.

The hierarchical prior

Let’s say that each of the  previous studies each has an estimated treatment effect , estimated with some standard error . Taken together, are these estimates of  the ground truth for the true treatment effect in their respective studies? One way of answering this is to ask ourselves: if the researchers of each of those studies replicated their study in precisely the same way, but after checking the estimates estimated by the other researchers, would they expect to find the same estimate they found before, ? Or would they perhaps expect some other treatment effect estimate, , that balances the information from their own study with the other studies?
The answer to this question gives rise to the hierarchical prior. In this prior, we say that the estimated treatment effect  is a noisy measure of the underlying treatment effect  for each study . These underlying effects are in turn noisy estimates of the true average treatment effect —noisy because of uncontrolled-for varation across experiments. That is, if we make assumptions of normality:

and

Where τ is the standard devation of the distribution of plausible experimental estimates.
The analysis therefore has the following steps:
  • Build a model of the treatment effects, considering our own study as another data point
  • Jointly estimate the hyperdistribution of treatment effects.
As an example, we’ll take the original 8-schools data, with some fake data for the experiment we want to estimate. The 8-schools example comes from an education intervention modeled by Rubin, in which a similar experiment was conducted in 8 schools, with only treatment effects and their standard errors reported. The task is to generate an estimate of the possible treatment effect that we might expect if we were to roll out the program across all schools.
library(rstan); library(dplyr); library(ggplot2); library(reshape2)

# The original 8 schools data
schools_dat <- data_frame(beta = c(28,  8, -3,  7, -1,  1, 18, 12),
                          se = c(15, 10, 16, 11,  9, 11, 10, 18))

# The known parameters of our data generating process for fake data
mu <- 10
sigma <- 5
N <- 20
# Our fake treatment effect estimate drawn from the posterior of the 8 schools example
theta_J <- rnorm(1, 8, 6.45) 

# Create some fake data
treatment <- sample(0:1, N, replace = T)
y <- rnorm(N, mu + theta_J*treatment, sigma)
The Stan program we use to estimate the model is below. Note that these models can be difficult to fit, and so we employ a “reparameterization” below for the thetas. This is achieved by noticing that if

then

where . A standard normal has an easier geometry for Stan to work with, so this parameterization of the model is typically preferred. Here is the Stan model:
// We save this as 8_schools_w_regression.stan
data {
  int<lower=0> J; // number of schools 
  int N; // number of observations in the regression problem
  real beta[J]; // estimated treatment effects from previous studies
  real<lower=0> se[J]; // s.e. of those effect estimates 
  vector[N] y; // the outcomes for students in our fake study data
  vector[N] treatment; // the treatment indicator in our fake study data
}
parameters {
  real mu; 
  real<lower=0> tau;
  real eta[J+1];
  real<lower = 0> sigma;
  real theta_hat;
}
transformed parameters {
  real theta[J+1];
  for (j in 1:(J+1)){
    theta[j] = theta_hat + tau * eta[j];
  }
}
model {
  // priors
  tau ~ cauchy(5, 2);
  mu ~ normal(10, 2);
  eta ~ normal(0, 1);
  sigma ~ cauchy(3, 3);
  theta_hat ~ normal(8, 5);
  
  // parameter model for previous studies
  for(j in 1:J) {
    beta[j] ~ normal(theta[j], se[j]);
  }
  
  // our regression
  y ~ normal(mu + theta[J+1]*treatment, sigma);
  
}
Now we estimate the model from R. Because of the geometry issues mentioned above, we use control = list(adapt_delta = 0.99) to prompt Stan to take smaller step sizes, improving sampling performance at a cost of slower estimation time (this isn’t a problem here; it estimates in a couple of seconds).
eight_schools_plus_regression <- stan("8_schools_w_regression.stan",
                       data = list(beta = schools_dat$beta,
                                   se = schools_dat$se,
                                   J = 8,
                                   y = y,
                                   N = N,
                                   treatment = treatment),
                       control = list(adapt_delta = 0.99))
Let’s comapare the estimates we get for our regression model to those we might get from the Bayesian model. A simple linear regression model gives us the following confidence intervals for the parameter estimates:
coefestimates2.5 %97.5 %
mu10.1148946.87728213.35251
theta[9]9.2015934.83599813.56719
Our Bayesian model gives us more precise estimates for the treatment effect, with the 95% credibility region considerably smaller. This is because we have “borrowed”" information from the previous studies when estimating the treatment effect in the latest study. The estimates are also closer to the group-level mean.
print(eight_schools_plus_regression, pars = c("mu", "theta[9]", "theta_hat"), probs = c(0.025, 0.5, 0.975))
## Inference for Stan model: 8_schools_w_regression.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##            mean se_mean   sd 2.5%   50% 97.5% n_eff Rhat
## mu        10.14    0.02 1.20 7.77 10.15 12.48  4000    1
## theta[9]   9.07    0.03 1.76 5.66  9.10 12.65  4000    1
## theta_hat  8.49    0.05 2.69 3.11  8.55 13.81  2407    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Sep  3 00:36:59 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
I hope this helps you implement hierarchical priors in your next research project. If you’ve any questions, or you’ve spotted errors in my reasoning or code, please comment!

No comments:

Post a Comment