Sunday, November 13, 2016

A ridiculously simple model of heterogeneous treatment effects

Here are a few common problems.
  1. We have a limited amount of disaster relief money that we wish to distribute to regions that are most likely to benefit from it.
  2. A political candidate wants to target their advertising towards those who are most likely to vote as a consequence. Or
  3. an e-commerce firm wants to offer a product for the highest price a customer will be willing to pay.
Problems like these require using models that allow some individuals to have different treatment effects than other individuals. These heterogeneous treatment effect models are all the rage in causal inference nowadays, and there are a boatload of methods currently in use. Some good examples include Wager and Athey’s causal forest idea, and Bayesian Additive Regression Trees (BART), used most famously by Green and Kern and Jennifer Hill.
To my mind there are two issues with these machine-learning based methods. First, they need a large number of observations to give stable estimates. That might not be an issue if you are a large e-commerce firm who can run enormous A/B tests, but if you are in health research or aid (where you measure the cost of each observation in thousands or millions of dollars), it might be prohibitive. Second, in an off-the-shelf application, there’s no way of using instrumental variables to help ameliorate the effects of unobserved confounders. This means they’re pretty useless in quasi-experimental settings.
Below I illustrate a much simpler heterogeneous treatment effects model for rank ordering units in terms of their likely treatment effects.
The idea is simple. There are three candidate models: one with a negative treatment effect, one with a treatment effect of zero, and one of a positive treatment effect. Our estimate of each individual’s treatment effect is the weighted average of the negative, zero and positive treatment effects, with the weights a function of individual characteristics. This has far fewer parameters than most heterogeneous treatment effects models, making it suitable for small data. It also can be easily extended to make use of instruments. It is similar in concept to the finite mixture models discussed here and here.

The generative model

Each individual  has outcome  and a vector of pre-treatment characteristics . Each individual has their own treatment effect , which is the expected difference in outcomes between the treated and untreated state for a binary treatment . Their individual treatment effect is a function of  and an idiosyncratic component .
For simplicity, we let all functions be linear. So the model of treatment effects is


And the outcome model is


where  and  are intercept terms,  and  are regression coefficients, and  are idiosyncratic errors that are mean zero and independently normal with scales  and .
Let’s simulate some fake data from this process (with known parameters)
# Data - number of observations, number of covariates, a treatment vector, a 
# covariate matrix, and a covariate matrix with a column of 1s
N <- 500
P <- 3
treatment <- sample(0:1, N, replace = T)
X <- matrix(rnorm(N*P), N, P)
X2 <- cbind(rep(1, N),X)

# Parameters (draw from prior) 
beta <- rnorm(P+1)
gamma <- rnorm(P+1, 0, .5)
sigma <- c(.5, 1)

tau <- X2%*%gamma + rnorm(N, 0, sigma[1])
Y <- X2%*%beta + tau*treatment + rnorm(N, 0, sigma[2])

Estimating the model

In theory, we could estimate the model directly by just writing out the model as above. I have done so here, but doing so is troublesome. First, the model is very loosely identified, and even with a large number of iterations we still get Rhats that are quite high (indicating that convergence has been lousy) and very low effective sample sizes. The identification really falls apart if we let ϵ and η be correlated. So estimating this model well means using better priors, which we might not have. Second, the number of parameters grows with the number of observations. HMC run-time grows at a low polynomial of the number of parameters, so this doesn’t really scale to big datasets.
The approach I advocate for here is much simpler. Rather than try to estimate a separate each individual’s treatment effect, we say there are three possible data generating processes:


here,  is strictly negative and  is strictly positive.  is then defined as


where  is a matrix that maps  onto the probability of each data generating process. We set the last row of  to zero to identify the model. The estimate of each individual’s treatment effect is then simply , where . The Stan code for this model is as so:
// saved as mixture_treatment.stan
data {
  int N; // observations
  int P; // covariates
  vector<lower = 0, upper = 1>[N] treatment; // binary treatment
  matrix[N, P] X; // covariates
  vector[N] Y;
}
transformed data {
  matrix[N, P+1] X2;
  X2 = append_col(rep_vector(1.0, N), X);
}
parameters {
  real alpha; // intercept
  vector[P] beta; // regression coefficients
  real<upper = 0> tau_1; // negative treatment effect
  real<lower = 0> tau_3; // positive treatment effect
  matrix[2, P+1] gamma_raw; // treatment effect model 
  real<lower = 0> sigma;
}
transformed parameters {
  matrix[N, 3] theta;
  {
  matrix[3, P+1] gamma; // treatment effect model parameters (zero centered)
  gamma = append_row(gamma_raw, rep_row_vector(0.0, P+1));  
  for(n in 1:N) {
    theta[n] = softmax(gamma*X2[n]');
  }
  }
}
model {
  alpha ~ normal(0, 1);
  beta ~ normal(0, 1);
  tau_1 ~ normal(0, 2);
  tau_3 ~ normal(0, 2);
  to_vector(gamma_raw) ~ normal(0, 1);
  sigma ~ cauchy(0, 1);
  
  for(n in 1:N) {
    vector[3] temp;
    
    temp[1] = log(theta[n,1]) + normal_lpdf(Y[n] | alpha + X[n]*beta + tau_1*treatment[n], sigma);
    temp[2] = log(theta[n,2]) + normal_lpdf(Y[n] | alpha + X[n]*beta, sigma);
    temp[3] = log(theta[n,3]) + normal_lpdf(Y[n] | alpha + X[n]*beta + tau_3*treatment[n], sigma);
    
    target += log_sum_exp(temp);
  }
}
generated quantities {
  vector[N] treatment_effect;
  vector[3] tau;
  tau[1] = tau_1;
  tau[2] = 0.0;
  tau[3] = tau_3;
  
  for(n in 1:N) {
    treatment_effect[n] = theta[n]*tau;
  }
}
We can estimate this model (which estimates very cleanly) with the following code:
library(dplyr); library(reshape2); library(ggplot2); library(rstan)
options(mc.cores = parallel::detectCores())


compiled_model_2 <- stan_model("mixture_treatment.stan")
estimated_model <- sampling(compiled_model_2, 
                            data = list(N = N,
                                        P = P, 
                                        X = X, 
                                        Y = as.numeric(Y),
                                        treatment = treatment))
We can then compare the known individual treatment effects to their estimates (code for this chart is a bit involved, if you want to replicate, please check the git version of this post).
As we can see, the model seems to be able to capture the vast majority of the variation in individual treatment effects, without modeling them explicitly. It does not do a perfect job. But due to the fact it has very few parameters, it will scale nicely to large datasets. We can also incorporate instruments for the treatment just as we would with regular Bayesian instrumental variables.
Now it’s not the perfect method. The main problem is that it has a hard maximum () and minimum () on the treatment effects. If the distribution of true treatment effects falls outside this hard maximum, it will shrink the estimates of individual treatment effects towards 0 too aggressively. This might be remedied by having more than 3 treatment effects. Another problem is that the softmax function always puts positive weight on the possibility of a positive, negative and zero treatment effect. Both of these problems come from the same cause: we’re approximating a linear function with a sigmoid. In simulations in which the treatment effects are all positive/all negative, the predictions were more extreme than the actuals. Cognizant of the above, I’d not use this model to estimate average treatment effects.
What the model does do very well is rank-order individual in terms of their likely treatment effects. And for many applications, that is what matters.

No comments:

Post a Comment