Sunday, April 30, 2017

An easy way to simulate fake data from your Stan model

The core of Modern Statistical Workflow is to always simulate fake data from your generative model before even touching real data. Whenever a friend asks for help with fitting a model in Stan, my first question is: “have you estimated the model on fake data simulated from the generative process you are proposing?” More often than not, the answer is “no”, and in the process of simulating the fake data they catch their own errors.
When simulating this fake data, our parameters should be drawn from the prior. Doing so tells us when our priors are stupid. Whenever you’re having to try to defend yourself against someone who’s telling you to use some crazily diffuse prior like  tell them to go through this exercise (we do this below). There’s also sense in drawing your parameters from the prior and then estimating your model on the resulting fake data: when your prior places little weight on the true value of a parameter, your inference (and sampling) will be compromised.
So how can we make this process easier?
The problem with this approach is that typically we have to take care of two bits of code: our simulation model and the estimation model. If we make a change in one, that change needs to be reflected in the other. That might not be an issue when your model is simple. But if you’re in the world of fitting models in Stan, then you’re trying to fit complex models. And making changes to two sets of code for a big, complex model is a royal PITA.

Making use of Stan’s generated quantities block to simulate fake data

The really simple fix to this is to handle all modeling, both simulation and estimation, in the one piece of code. We do this by only evaluating the likelihood if we tell it to. If a likelihood is not evaluated, then our posterior is equal to the prior. In this case, our posterior predictive draws are actually prior predictive draws—precisely the fake data we want.
As a very simple example, let’s put together a tiny classification model. Say that our outcome  is distributed according to a categorical distribution with probabilities vector 

To identify the model, we’ll set the first element of  to 0. Over the rest of the parameters we’ll give a very diffuse prior, say, .
We code this up in Stan as:
// saved as categorical_model.stan
data {
  int N;
  int P; // number of categories to be estimated
  int y[N]; // outcomes
  int<lower = 0, upper = 1> run_estimation; // a switch to evaluate the likelihood
  real<lower = 0> prior_sd; // standard deviation of the prior on theta
}
parameters {
  vector[P-1] theta_raw;
}
transformed parameters {
  vector[P] theta;
  theta[1] = 0.0;
  theta[2:P] = theta_raw;
}
model {
  // prior
  theta_raw ~ normal(0, prior_sd);
  
  // likelihood, which we only evaluate conditionally
  if(run_estimation==1){
    y ~ categorical(softmax(theta));
  }
}
generated quantities {
  vector[N] y_sim;
  for(i in 1:N) {
    y_sim[i] = categorical_rng(softmax(theta));
  }
}
Now let’s compile and estimate this model with run_estimation=0, that is, in simulation mode. We’ll grab a random draw from the posterior, which will contain some fixed parameters, as well as the draws of fake data that we want.
compiled_model <- stan_model("categorical_model.stan")

sim_out <- sampling(compiled_model, data = list(N = 1000, 
                                                P = 5, 
                                                # Y should be real but fake data if we're simulating
                                                # new Y
                                                y = sample(1:2, 1000, replace = T), 
                                                run_estimation = 0,
                                                prior_sd = 100))

library(dplyr)

fake_data_matrix  <- sim_out %>% 
  as.data.frame %>% 
  select(contains("y_sim"))

summary_tbl <- apply(fake_data_matrix[1:5,], 1, summary)
Now we’ve got some fake data draws. Lets look at the distribution of outcomes for a very diffuse prior. Remember, each row of fake_data_matrix is associated with the corresponding prior draw.
12345
Min.22.000324
1st Qu.25.000324
Median25.000324
Mean24.985324
3rd Qu.25.000324
Max.25.000324
The rows of the table are the relevant summary statistics of the simulated outcome for each draw. The columns correspond to the first five draws. As is clear, in each draw almost all the outcomes are precisely the same. This is because we’ve chosen a silly diffuse prior. In many cases we don't really expect the parameters to take on values that make our outcome all but deterministic. (If this doesn’t make sense, check the definition of the softmax function and imagine what happens if you draw one  with a value of 400 or something). Let’s try it again with a more sensible prior.
sim_out_narrowprior <- sampling(compiled_model, data = list(N = 1000, 
                                                P = 5, 
                                                # Y should be real but fake data if we're simulating
                                                # new Y
                                                y = sample(1:2, 1000, replace = T), 
                                                run_estimation = 0,
                                                prior_sd = 1))


fake_data_matrix_2  <- sim_out_narrowprior %>% 
  as.data.frame %>% 
  select(contains("y_sim"))

summary_tbl_2 <- apply(fake_data_matrix_2[1:5,], 1, summary)
12345
Min.1.0001.0001.0001.0001.000
1st Qu.2.0003.0002.0002.0003.000
Median3.0003.0002.0002.0003.000
Mean2.8073.1862.3763.0143.191
3rd Qu.4.0004.0002.0004.0004.000
Max.5.0005.0005.0005.0005.000
Much better! With a narrower prior we’re actually getting a variation in the outcome. Let’s grab our thetas and fake data from that run for a given draw. Say, 21. Then we’ll run the model again to make sure we can recapture those known thetas.
draw <- 21
y_sim <- extract(sim_out_narrowprior, pars = "y_sim")[[1]][draw,]
true_thetas <- extract(sim_out_narrowprior, pars = "theta_raw")[[1]][draw,]

# Now estimate the model with this "true" data, to see that we can estimate the known parameters

dgp_recapture_fit <- sampling(compiled_model, data = list(N = 1000, 
                                                P = 5, 
                                                # Note we now use y_sim
                                                y = y_sim, 
                                                # Note we not switch on run_estimation
                                                run_estimation = 1,
                                                prior_sd = 1))
How did we go?
Great! We just simulated some fake data from a Stan model, used it to discover that our original priors were insane, simulated again, and used the fake data to validate our model. Believe me—if you build models for a living, getting into the habit of doing this will save a lot of heartache.
But it gets even better. When the likelihood is switched on, then our simulated data are our posterior predictive simulations, which are useful for model checking. But that's another post. 

1 comment:

  1. Hi Jim, thanks for the post! I was just trying to figure this out myself.

    Some questions:
    1. if you have `P=5` in your categorical model you are expecting five different levels, right? Then it seems inappropriate to use `y = sample(1:2, 1000, replace = T)` because y will have only 2 unique values. It doesn't really matter in your case because it is actually never used when `run_estimation == 0`, but still..
    2. with a narrow prior, it seems your model can recover the parameters exactly because the true parameters and the data are generated using the same model, and you already know where they will be at. With real data and many predictors, the parameter values tend to be more diffusive, then you can't possibly know where the parameter values will be before you fit the model. In this case, using a narrow prior would bias the estimation, or? And if you can't use narrow priors, again the simulated data can be quite different than the real data..

    I ask this because I actually have a certain model in mind. I'm trying to fit a ordered categorical model, and when simulating data with given parameters (cutpoints and linear model predictors), slight change in parameter values can lead to quite different simulated data, so I'm not sure if your approach is feasible in this scenario.

    ReplyDelete