#
*Jim Savage*

Earlier this week, an interesting young researcher (names withheld to protect the innocent) came by to get some advice on one of their models. They were implementing Mr P with a model of individual ’s decision to vote for some candidate (indicated by ) with a sum-of-random-effects model. The idea with these models is that you have a bunch of demographic buckets, so that your data look like this:

Individual | Voted for candidate | Age bucket | Education | … |
---|---|---|---|---|

1 | 1 | [18, 25) | < 9th grade | … |

2 | 0 | [25,34) | Masters | … |

3 | 0 | 65+ | High school | … |

4 | 0 | [18, 25) | Masters | … |

5 | 1 | [45,54) | Bachelors | … |

Of course we can convert these strings into indexes like so.

Individual | Voted for candidate | Age bucket | Education | … |
---|---|---|---|---|

1 | 1 | 1 | 1 | … |

2 | 0 | 2 | 4 | … |

3 | 0 | 4 | 3 | … |

4 | 0 | 1 | 4 | … |

5 | 1 | 3 | 2 | … |

And then estimate a model like

The problem with this particular researcher’s implementation was that they had been encouraged to use extremely wide priors for the s, like with .

What does that mean in terms of what the researcher considers to be probable? Well, it means for instance that they think that the scale of the random effects would have roughly a 1/3rd probability of being greater than 100. That

*might*be OK in a linear regression, but for logit regression it makes no sense. For instance, it indicates that the researcher believes there’s about a 98% probability that a given demographic cell would vote for the candidate in more than 95% of cases! Remember—with logit, 4 is basically infinity.
A second example

Another similar example I see a bit of is in time-series modeling, where researchers frequently write out some model, for instance a toy AR(1) like

Again: the researcher is saying that they give 92% probability to the series being explosive.

A more difficult example

A close friend has constructed a structural model that takes a set of behavioural parameters for each individual, , and maps these to the “utility” coming from contract, which we index with . If is a vector of characteristics of the contract (for example, the amounts that need to be paid under various scenarios), then each contract is worth utils to each individual, where is iid Gumbel distributed. The aim is to estimate the joint distribution of across individuals.

The nice thing about using Gumbel errors in a framework like this is that it makes the likelihood easy to construct. If is the vector of utilities from each choice, then

The problem she encountered is that contains large numbers, with payments in the thousands of dollars. Even with priors that might “look” fairly tight, like , can be an enormous number, and imply choice probabilities very close to 1 or 0. If we observe someone making a choice assigned a probability close to 0, the gradient of the log likelihood becomes very difficult to evaluate, and your model will have a tough time converging.

### So, what do we do?

There are a couple of approaches to helping to resolve these problems in applied modeling:

Learn about how your prior affects your model by simulation

Perhaps the most straightforward way to do this is to simulate first from your prior, then from your data generating process. There are easy ways to do this from within Stan.

What you will quickly realize is that some priors imply ridiculous things for your data. Unless you want to assign a probability that these ridiculous events are possible, think more carefully about these priors. In almost every case, a prior is going to be doing you damage. But often a prior will similarly imply weird things for your data.

Often it’s easier to express a prior on a space we can understand and map it to a constrained space

A few of the examples above are easily remedied by just scaling in the width of the priors. In other models —particularly time-series models—it’s not so easy. For example, what if we had an AR(2) model in example 2? A “stable” prior would be a (probably constrained) prior on the roots of the characteristic polynomial, which wouldn’t necessarily have an invertable relationship with the model parameters. This is difficult stuff!

One fairly commonly-used approach is to use some accept-reject method where we throw out combinations of parameters that do not imply stationarity, like in the famous Cogley Sargent (2001) paper. That method is fine from a theoretical viewpoint, but it does not scale to higher dimensions. Another, far better but more difficult approach is to define priors on a space that we understand—for example the partial autocorrelations—so that we have a mapping between those and the model parameters we care about. That’s the approach discussed by Zhang and McLeod here. This mapping needn’t be 1-1. A great example is Ben Goodrich’s prior for linear regression, which allows the researcher to express a prior over the of the regression.

The prior is information that is not in your data

The reason I got into Bayesian statistica in the first place is because I model portfolios of debt in Sub-Saharan Africa. By definition, the clients who come to Lendable to raise financing have yet to experience a “killer” portfolio shock. That is, the shocks that we have observed in their portfolios are certainly smaller than the potential shocks out there. We know this—our clients are still in business! Yet when we securitize their book, we want to generate uncertainty bounds that take into account the size of future potential shocks. The scale of these shocks, of course, will be under-estimated by the data. That’s where informative priors—research!—comes in.

Even when you’re in the world of generating good priors through deep research, you still need to think about how they work within your model! This essay by Gelman, Simpson and Betancourt summarizes the current state of affairs at the bleeding edge.

Doing applied work is hard, and choosing priors is important. But it’s an important part of your job, and frankly a small part of me dies every time I see someone write

`theta ~ normal(0, 100)`

.
It is not fine from a theoretical viewpoint to throw out combinations of parameters that do not imply stationarity in Stan. For an AR(2) model ( https://en.wikipedia.org/wiki/Autoregressive_model#AR.282.29 ), if you put

ReplyDeleteif (varphi2 < -1 || phi2 >= 1 - fabs(varphi1)) reject("explosive");

in your model block, then priors like

varphi1 ~ normal(0,1); varphi2 ~ normal(0,1);

do not integrate to 1 over the non-explosive region of the parameter space. So, the resulting posterior distribution is not right. Rejecting draws is only valid if your truncate your priors appropriately or if varphi1 and varphi2 have improper uniform priors, and even then Stan probably won't be able to sample from the posterior distribution efficiently enough for you to trust the posterior mean estimates.

Ben -- is it right to say that in Gibbs sampling it doesn't matter? That is, say your prior was p(theta) \propto I(theta) N(mu, Sigma), where I() is 0 when theta is explosive, then p(theta | y) \propto I(theta) N(mu, Sigma) p(y | theta). That seems fine to me, but I'm probably missing something.

DeleteAlso I've never totally grokked why it's OK with improper uniform priors. I've heard people say that it implies a Haar prior over the coefficients. Can you point me to anything?

The other interesting criticism of that approach (from Sims' comment to the original paper) is that for many time series we probably want a lot of mass on the unit root, so that we don't interpret historical shocks as being transitory.

Delete