Sunday, October 15, 2017

Your model gives you a lot of information about what your prior should(n’t) be

Earlier this week, an interesting young researcher (names withheld to protect the innocent) came by to get some advice on one of their models. They were implementing Mr P with a model of individual ’s decision to vote for some candidate (indicated by ) with a sum-of-random-effects model. The idea with these models is that you have a bunch of demographic buckets, so that your data look like this:
IndividualVoted for candidateAge bucketEducation
11[18, 25)< 9th grade
3065+High school
40[18, 25)Masters
Of course we can convert these strings into indexes like so.
IndividualVoted for candidateAge bucketEducation
And then estimate a model like
Where each of the s (aside from p=0) are random effects centered around 0 and have their own scale, which we estimate.

The problem with this particular researcher’s implementation was that they had been encouraged to use extremely wide priors for the s, like  with .
What does that mean in terms of what the researcher considers to be probable? Well, it means for instance that they think that the scale of the random effects would have roughly a 1/3rd probability of being greater than 100. That might be OK in a linear regression, but for logit regression it makes no sense. For instance, it indicates that the researcher believes there’s about a 98% probability that a given demographic cell would vote for the candidate in more than 95% of cases! Remember—with logit, 4 is basically infinity.
A second example
Another similar example I see a bit of is in time-series modeling, where researchers frequently write out some model, for instance a toy AR(1) like
and then give a prior to  like

If  is greater than 1, then the time series  is explosive. The prior given above assumes that up-front, around 92% of the density is for values with an absolute value greater than 1.

Again: the researcher is saying that they give 92% probability to the series being explosive.
A more difficult example
A close friend has constructed a structural model that takes a set of behavioural parameters for each individual, , and maps these to the “utility” coming from contract, which we index with . If  is a vector of characteristics of the contract (for example, the amounts that need to be paid under various scenarios), then each contract is worth  utils to each individual, where  is iid Gumbel distributed. The aim is to estimate the joint distribution of  across individuals.
The nice thing about using Gumbel errors in a framework like this is that it makes the likelihood easy to construct. If  is the vector of utilities from each choice, then

The problem she encountered is that  contains large numbers, with payments in the thousands of dollars. Even with priors that might “look” fairly tight, like  can be an enormous number, and imply choice probabilities very close to 1 or 0. If we observe someone making a choice assigned a probability close to 0, the gradient of the log likelihood becomes very difficult to evaluate, and your model will have a tough time converging.

So, what do we do?

There are a couple of approaches to helping to resolve these problems in applied modeling:
Learn about how your prior affects your model by simulation
Perhaps the most straightforward way to do this is to simulate first from your prior, then from your data generating process. There are easy ways to do this from within Stan.
What you will quickly realize is that some priors imply ridiculous things for your data. Unless you want to assign a probability that these ridiculous events are possible, think more carefully about these priors. In almost every case, a  prior is going to be doing you damage. But often a  prior will similarly imply weird things for your data.
Often it’s easier to express a prior on a space we can understand and map it to a constrained space
A few of the examples above are easily remedied by just scaling in the width of the priors. In other models —particularly time-series models—it’s not so easy. For example, what if we had an AR(2) model in example 2? A “stable” prior would be a (probably constrained) prior on the roots of the characteristic polynomial, which wouldn’t necessarily have an invertable relationship with the model parameters. This is difficult stuff!
One fairly commonly-used approach is to use some accept-reject method where we throw out combinations of parameters that do not imply stationarity, like in the famous Cogley Sargent (2001) paper. That method is fine from a theoretical viewpoint, but it does not scale to higher dimensions. Another, far better but more difficult approach is to define priors on a space that we understand—for example the partial autocorrelations—so that we have a mapping between those and the model parameters we care about. That’s the approach discussed by Zhang and McLeod here. This mapping needn’t be 1-1. A great example is Ben Goodrich’s prior for linear regression, which allows the researcher to express a prior over the  of the regression.
The prior is information that is not in your data
The reason I got into Bayesian statistica in the first place is because I model portfolios of debt in Sub-Saharan Africa. By definition, the clients who come to Lendable to raise financing have yet to experience a “killer” portfolio shock. That is, the shocks that we have observed in their portfolios are certainly smaller than the potential shocks out there. We know this—our clients are still in business! Yet when we securitize their book, we want to generate uncertainty bounds that take into account the size of future potential shocks. The scale of these shocks, of course, will be under-estimated by the data. That’s where informative priors—research!—comes in.
Even when you’re in the world of generating good priors through deep research, you still need to think about how they work within your model! This essay by Gelman, Simpson and Betancourt summarizes the current state of affairs at the bleeding edge.
Doing applied work is hard, and choosing priors is important. But it’s an important part of your job, and frankly a small part of me dies every time I see someone write theta ~ normal(0, 100).


  1. It is not fine from a theoretical viewpoint to throw out combinations of parameters that do not imply stationarity in Stan. For an AR(2) model ( ), if you put

    if (varphi2 < -1 || phi2 >= 1 - fabs(varphi1)) reject("explosive");

    in your model block, then priors like

    varphi1 ~ normal(0,1); varphi2 ~ normal(0,1);

    do not integrate to 1 over the non-explosive region of the parameter space. So, the resulting posterior distribution is not right. Rejecting draws is only valid if your truncate your priors appropriately or if varphi1 and varphi2 have improper uniform priors, and even then Stan probably won't be able to sample from the posterior distribution efficiently enough for you to trust the posterior mean estimates.

    1. Ben -- is it right to say that in Gibbs sampling it doesn't matter? That is, say your prior was p(theta) \propto I(theta) N(mu, Sigma), where I() is 0 when theta is explosive, then p(theta | y) \propto I(theta) N(mu, Sigma) p(y | theta). That seems fine to me, but I'm probably missing something.

      Also I've never totally grokked why it's OK with improper uniform priors. I've heard people say that it implies a Haar prior over the coefficients. Can you point me to anything?

    2. The other interesting criticism of that approach (from Sims' comment to the original paper) is that for many time series we probably want a lot of mass on the unit root, so that we don't interpret historical shocks as being transitory.