Sunday, November 26, 2017

Bayesian Instrumental variables (with a hierarchical prior) in Stan

The standard implementations of instrumental variables regression, using either linear algebra (2SLS/IV) or optimization (GMM) are well known, and there is no shortage of good resources on their use. Yet IV regression does have a few notable issues, especially in the presence of weak instruments, small samples, or outliers. If you have access to good quality prior information about effect sizes, then using Bayesian instrumental variables regression can help you with these issues. This post walks you through how to implement Bayesian IV in Stan. The particular implementation I’ll walk you through shows you how to use meta-analytic hierarchical priors. That is, priors for your treatment effect that come from previous studies of the same effect. These can be useful for when you want to both borrow power from previous research, and evaluate how much your new observations update a resonable observer’s posterior of the effect size given previous research.
The statistical model
The standard instrumental variables linear model has, for each observation , an outcome . We want to know the effect of the shift of some endogenous treatments  on this outcome. We have access to a vector of pre-treatment information , and some instruments that exogenously vary . There must be as many or more instruments in  than there are endogenous variables in .
The generative model is

(the first stage) and

(the second stage).  and  are intercepts, and , and  are regression coefficients.  is the what we are interested in.
The important thing to note is that the errors  and  are correlated, which is why you can’t just estimate the second stage alone. To abuse the language a little, the unobserved things that cause your treatment () are related to the unobserved things that cause your outcome (). But under the Gauss-Markov theorem, we require strict exogeneity (errors of a linear regression must be uncorrelated with the covariates). That’s where the bias we’re trying to correct comes from.
The Bayesian approach to this model is to specify the joint likelihood of the endogenous regressors and the outcome, then put priors on the unknowns.
Let’s call  with  and .
Now we assume that the residuals are either jointly normal or jointly Student-t (which will be more robust to outliers).
Where  is the scale of the residuals and  is their correlation matrix. We like this parameterization (rather than specifying the covariance matrix of the residuals directly) as it’s easier to give informed priors to the scale and correlation of the errors. If we want to use a multivariate Student’s t distribution, we just specify

where  is the degrees of freedom parameter to be estimated; small values of this indicate fat tails.

This gives us the generative model for the observed data, from which we can derive the likelihood:

Note that for this to be a truly generative model we ought to be able to actually generate data with it. That means having priors for the unknowns. We typically start with fairly non-informative priors on the scale of the residuals; something like
for data of order-of-magnitude around 1. Then we give a prior to the correlation matrix, normally an LKJ prior with degrees of freedom inversely proportional to the degree of endogeneity we think are in the data. If we think there’s a fair bit of endogeneity, then a prior degrees of freedom around 2 might be reasonable.


For covariates that have been scaled, I’ll normally use a prior like  for  and 

And finally, the prior on –the whole reason we’re going through this pain!

A hierarchical prior on 

Imagine you’re not the first researcher to investigate some causal effect and there exists a literature of other estimates of the same thing. Each of these studies (there are  of these, indexed by ) is trying to measure , but due to study idiosyncrasies, the true effect in each study is . Each study generates a noisy measure of this estimate, , which has its own standard error, . I go over this in a bit more detail here.
The generative model here is
and

Now if we want to use this hierarchical prior in our instrumental variables model, all we need to do is say that our , let’s call it , comes from the same parent distribution, that is

And that is our hierarchical prior.

Implementing the model in Stan

Below I’ll give you code for two versions of the model. The first one has a boring prior on  but assumes multiple endogenous regressors. The second assumes a single endogenous regressor and uses the hierarchical prior. You could of course do both, but the sorts of situations I’m thinking about you having good prior information is more when you have a single effect you care about, not many.
Note that in this implementation we allow you to provide a flag for whether or not you’d like it to estimate; if you don’t estimate, it’ll generate fake data, This is an excellent way to check that the model code is correct. I’ll show you how to do that below.
No hierarchical priors, multiple endogenous regressors
data {
  int N;
  int PX; // dimension of exogenous covariates
  int PN; // dimension of endogenous covariates
  int PZ; // dimension of instruments
  matrix[N, PX] X_exog; // exogenous covariates
  matrix[N, PN] X_endog; // engogenous covariates
  matrix[N, PZ] Z; // instruments
  vector[N] Y_outcome; // outcome variable
  int<lower=0,upper=1> run_estimation; // simulate (0) or estimate (1)
}
transformed data {
  matrix[N, 1 + PN] Y;
  Y[,1] = Y_outcome;
  Y[,2:] = X_endog;
}
parameters {
  vector[PX + PN] gamma1;
  matrix[PX + PZ, PN] gamma2;
  vector[PN + 1] alpha;
  vector<lower = 0>[1 + PN] scale;
  cholesky_factor_corr[1 + PN] L_Omega;
}
transformed parameters {
  matrix[N, 1 + PN] mu; // the conditional means of the process

  mu[:,1] = rep_vector(alpha[1], N) + append_col(X_endog,X_exog)*gamma1;
  mu[:,2:] = rep_matrix(alpha[2:]', N) + append_col(X_exog, Z)*gamma2;

}
model {
  // priors
  to_vector(gamma1) ~ normal(0, 1);
  to_vector(gamma2) ~ normal(0, 1);
  to_vector(alpha) ~ normal(0, 1);
  scale ~ cauchy(0, 1);
  L_Omega ~ lkj_corr_cholesky(4);

  // likelihood
  if(run_estimation ==1){
    for(n in 1:N) {
      Y[n] ~ multi_normal_cholesky(mu[n], diag_pre_multiply(scale, L_Omega));
    }
  }
}
generated quantities {
  matrix[N, 1 + PN] Y_simulated;

  for(n in 1:N) {
    Y_simulated[n, 1:(1+PN)] = multi_normal_cholesky_rng(mu[n]', diag_pre_multiply(scale, L_Omega))';
  }
}
One endogenous regressor, hierarchical prior
// saved as classic_iv_hierarchical_prior.stan
data {
  int N;
  int PX; // dimension of exogenous covariates
  int PZ; // dimension of instruments
  int J; // number of previous studies
  vector[J] b_j; // previous estimates
  vector<lower = 0>[J] se_j; // standard error of previous estimates
  matrix[N, PX] X_exog; // exogenous covariates
  vector[N] X_endog; // engogenous covariates
  matrix[N, PZ] Z; // instruments
  vector[N] Y_outcome; // outcome variable
  int<lower=0,upper=1> run_estimation; // simulate (0) or estimate (1)
}
transformed data {
  matrix[N, 2] Y;
  Y[,1] = X_endog;
  Y[,2] = Y_outcome;
}
parameters {
  // intercepts
  vector[2] alpha;
  
  // hierarchical prior parameters
  vector[J] z;
  real beta_hat;
  real<lower = 0> sigma_beta;
  
  // first stage coefficients
  vector[PZ] Delta;
  vector[PX] Gamma;
  
  // second stage coefficients
  vector[PX] delta;
  real beta_ours;
  
  // covariance matrix parameters
  cholesky_factor_corr[2] L_Omega;
  vector<lower = 0>[2] tau;
}
transformed parameters {
  matrix[N, 2] mu; // the conditional means of the process
  vector[J] beta_j;
  
  // first stage
  mu[:,1] = rep_vector(alpha[1], N) + append_col(X_exog, Z)*append_row(Delta, Gamma);
  // second stage
  mu[:,2] = rep_vector(alpha[2], N) + append_col(X_endog,X_exog) * append_row(beta_ours, delta);
  
  // this the non-centered parameterization (if beta_j ~ normal(beta_hat, sigma_beta) then
  // beta_j = beta_hat + sigma_beta *z with z ~ normal(0, 1))
  
  for(j in 1:J) {
    beta_j[j] = beta_hat + sigma_beta * z[j];
  }
}
model {
  // intercept prior
  alpha ~ normal(0, 1);
  
  // first stage priors
  Delta ~ normal(0, 1);
  Gamma ~ normal(0, 1);
  
  // second stage priors
  delta ~ normal(0, 1);
  
  // hierarchical prior 
  
  z ~ normal(0, 1);
  
  for(j in 1:J) {
    b_j[j] ~ normal(beta_j[j], se_j[j]);
  }
  
  beta_ours ~ normal(beta_hat, sigma_beta);
  
  // Covarince matrix prior
  tau ~ cauchy(0, 2);
  L_Omega ~ lkj_corr_cholesky(2);

  // likelihood
  if(run_estimation ==1){
    for(n in 1:N) {
      Y[n] ~ multi_normal_cholesky(mu[n], diag_pre_multiply(tau, L_Omega));
    }
  }
}
generated quantities {
  vector[N] y_sim;
  vector[N] x_endog;
  
  for(n in 1:N) {
    vector[2] error;
    error = multi_normal_cholesky_rng(rep_vector(0.0, 2), diag_pre_multiply(tau, L_Omega));
    
    x_endog[n] = mu[n, 1] + error[1];
    y_sim[n] = alpha[2] + x_endog[n] * beta_ours + X_exog[n]* delta + error[2];
  }
}
Fake data example
Now I’ve shown you how to write out the model in Stan, let’s use it to simulate some fake data given some instruments and covariates, then estimate the model both using classic IV and Bayesian IV.
First we’re going to make some fake data to feed to the model.
set.seed(42)
# Load the stan library
library(rstan)
options(mc.cores = parallel::detectCores())

# Compile the model
compiled_model <- stan_model("classic_iv_hierarchical_prior.stan")
# Let's make some fake data

N <- 1000 # Number of observations
PX <- 5 # Number of exogenous variables
PZ <- 2 # Number of instruments
J <- 10 # Number of previous studies

# Previous study parameters, drawn from the prior
beta_hat <- rnorm(1, 0, 1)
sigma_beta <- truncnorm::rtruncnorm(1, a = 0)
beta_j <- rnorm(J, beta_hat, sigma_beta)
se_j <- truncnorm::rtruncnorm(J, a = 0)
b_j <- rnorm(J, beta_j, se_j)

# Exogenous variables (make them correlated with a random correlation matrix)
X_exog <- MASS::mvrnorm(N, rep(0, PX), cor(matrix(rnorm(PX*(PX+5)), PX+5, PX)))

# We have to feed it some dummy endogenous variables but these won't make a difference
X_endog <- rnorm(N)

# Some fake instruments
Z <- MASS::mvrnorm(N, rep(0, PZ), cor(matrix(rnorm(PZ*(PZ+5)), PZ+5, PZ)))

Y_outcome <- rnorm(N)

data_list <- list(N = N, PX = PX, PZ = PZ, J = J,
                  b_j = b_j, se_j = se_j, X_exog = X_exog, X_endog = X_endog, 
                  Z = Z, Y_outcome = Y_outcome, 
                  run_estimation = 0)
And get some fake data draws from the model.
# Now let's get some fake draws
draws_from_model <- sampling(compiled_model, data = data_list, iter = 50, chains = 1)
What we’ve done above is to first compiled the Stan code into machine code. Then we made some fake data for the “meta-analysis” we have conducted. Then we made fake values for  and  (we had to make fake values for  and the endogenous regressor also because Stan needs us to give it a data list with values for everything declared in the data block). Finally we asked Stan to “sample” from the model. Because we weren’t evaluating the likelihood, this is equivalent to drawing from the prior, then drawing from the generative model, for 50 iterations.
What we’ll do now is get the fake data out of the fitted Stan object and run Stan on the fake data.
# Let's use the next to last non-warm-up draw as our fake data (draw 24)
y_sim <- extract(draws_from_model, pars = "y_sim")[[1]][24,]
x_endog <- extract(draws_from_model, pars = "x_endog")[[1]][24,]
true_beta <- extract(draws_from_model, pars = "beta_ours")[[1]][24]

# Now let's make a new data list
data_list_2 <- data_list
data_list_2$X_endog <- x_endog
data_list_2$Y_outcome <- y_sim
data_list_2$run_estimation <- 1
What we’ve done is extracted the fake data from the “model fit” and replaced the old entries in the data list with the “real” values of the endogenous regressor and outcome. We’ve also switched run_estimation on. So this time she’ll estimate.
# Fit the model
fitted_model <- sampling(compiled_model, data = data_list_2, cores = 4, chains = 4, iter = 1000)
Great! That estimated fairly cleanly in about three minutes on my laptop. Let’s check the estimates.
print(fitted_model, pars = c("beta_ours"))
## Inference for Stan model: classic_iv_hierarchical_prior.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##           mean se_mean   sd 2.5%  25%  50% 75% 97.5% n_eff Rhat
## beta_ours 1.65       0 0.08 1.49 1.59 1.65 1.7  1.81   582 1.01
## 
## Samples were drawn using NUTS(diag_e) at Sun Nov 26 22:56:15 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
print(paste0("And the true value of beta_ours is ", round(true_beta, 3)))
## [1] "And the true value of beta_ours is 1.703"
So our estimate is within the 95% band. Now let’s stick the data together and estimate the same model using vanilla IV.
library(AER)

iv_data <- data.frame(Y_outcome = y_sim, X_endog = x_endog, X_exog, Z)

iv_fit <- ivreg(Y_outcome ~ X_endog + X1+X2+X3+X4+X5| X1.1 + X2.1+ X1+X2+X3+X4+X5, data = iv_data)
# That fit instantly! 

# ...but
summary(iv_fit)
## 
## Call:
## ivreg(formula = Y_outcome ~ X_endog + X1 + X2 + X3 + X4 + X5 | 
##     X1.1 + X2.1 + X1 + X2 + X3 + X4 + X5, data = iv_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -10.47353  -2.18462  -0.05983   2.11454  10.12866 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.40662    0.65526  -2.147   0.0321 *  
## X_endog      1.95165    0.45899   4.252 2.32e-05 ***
## X1           0.06253    0.12955   0.483   0.6294    
## X2          -0.08761    0.22900  -0.383   0.7021    
## X3          -0.03708    0.25834  -0.144   0.8859    
## X4           0.11932    0.20094   0.594   0.5528    
## X5           0.19293    0.18676   1.033   0.3018    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.122 on 993 degrees of freedom
## Multiple R-Squared: 0.8395,  Adjusted R-squared: 0.8385 
## Wald test: 7.656 on 6 and 993 DF,  p-value: 4.691e-08
Which is also close—but because it doesn’t make use of the hierarchical prior, does not do as well. Let’s plot the estimates against the true value (using draws from the pseudo-posterior for IV).
And hey presto! We’ve fit the fake data with Bayes and with vanilla IV. In this example, you can see that both the vanilla IV and the Bayesian estimate both get pretty close to the true effect. Yet the Bayesian estimate is way more precise, because it’s using that information in the hierarchical prior.

3 comments:

  1. Hi Jim,

    Many thanks for this post. I am working on a case that matches the relatively simple IV case you cover. The context is alcohol damage cost and alcohol taxes at the state level, for the US, and I think it is an interesting public policy question, where it would be good to test some causal claims.

    I have one endogenous regressor, and using traditional metrics some plausible instrument options. I have been able to implement a range of classical models for this type of problem, but I would also be interested in exploring an implementation of the Bayesian model you describe.

    I am reasonably competent with classical statistics and R, and at the moment I actually have a pretty much complete paper using both standard IV appraoch and an IV-quantile regression approach. I am however only just starting to come to terms with Bayesian options, and getting Stan to run. And while i do have stan running up and running I fear I could do something quite dumb if I were to just run with the code you have provided without a really good understanding of the overall Bayesian approach. Essentially I don't feel confident I would be able to do a good job of the evaluate and criticize the model step.

    As such I was just wondering if you were at all interested in collaborating on this research topic. Essentially I think i have a done some pretty solid work, and have been able to pull a range of complementary evidence together to support the current results in the paper, but I would like to augment my classical analysis with your Bayesian IV version.

    If a collaboration of this type is at all something of interest please let me know: James.Fogarty(at)uwa.edu.au

    I am based at the University of Western Australia. If you had any interest i could share a copy of the current version of the paper, and that would give you a chance to see the quality of my work, and maybe make an assessment of whether or not i more fit the category of crackpot, or someone that is reasonably OK with classical econometrics and economics, and just looking to create some partnerships with people in the Bayesian path. To encourage collaborations my employer does not weight publication by number of authors. So whether there is one author on a paper or five I get the same credit.

    My other option seems to be to wait for: Bayesian Econometrics with Stan 1st Edition to come out. Which i hope is really not that far away.

    Regards,
    James Fogarty
    Agricultural and Resource Economics
    University of Western Australia

    ReplyDelete
  2. Thank you for putting together this great piece of tutorial. I am not very familiar with the mathematical notion used. what does the "'" in `mu[:,2:] = rep_matrix(alpha[2:]', N) + append_col(X_exog, Z)*gamma2;` mean?

    it seems to me that it may mean the transpose of the vector?

    what do I need to do if I want to learn more about the stan syntax, like rep_matrix and append_col?

    How do you usually debug these transformations of data?

    thanks!

    ReplyDelete
  3. Key Concepts in Bayesian IV:

    Endogenous Variable (Y): The variable whose causal effect is of interest, but it is correlated with the error term.
    Exogenous Variable (X): The independent variable that is not correlated with the error term and influences Y only through the endogenous variable. (This is the instrument)
    Instrument Strength: This measures how well the instrument predicts the endogenous variable.
    Prior Distributions: These represent initial beliefs about the model parameters before observing data.
    Posterior Distribution: This is the updated distribution of the model parameters after incorporating the observed data and prior information.
    Markov Chain Monte Carlo (MCMC): A common method for sampling from the posterior distribution in Bayesian inference.

    Deep Learning Projects for Final Year

    ReplyDelete