## Sunday, October 15, 2017

Earlier this week, an interesting young researcher (names withheld to protect the innocent) came by to get some advice on one of their models. They were implementing Mr P with a model of individual ’s decision to vote for some candidate (indicated by ) with a sum-of-random-effects model. The idea with these models is that you have a bunch of demographic buckets, so that your data look like this:
IndividualVoted for candidateAge bucketEducation
20[25,34)Masters
3065+High school
40[18, 25)Masters
51[45,54)Bachelors
Of course we can convert these strings into indexes like so.
IndividualVoted for candidateAge bucketEducation
1111
2024
3043
4014
5132
And then estimate a model like
Where each of the s (aside from p=0) are random effects centered around 0 and have their own scale, which we estimate.

The problem with this particular researcher’s implementation was that they had been encouraged to use extremely wide priors for the s, like  with .
What does that mean in terms of what the researcher considers to be probable? Well, it means for instance that they think that the scale of the random effects would have roughly a 1/3rd probability of being greater than 100. That might be OK in a linear regression, but for logit regression it makes no sense. For instance, it indicates that the researcher believes there’s about a 98% probability that a given demographic cell would vote for the candidate in more than 95% of cases! Remember—with logit, 4 is basically infinity.
A second example
Another similar example I see a bit of is in time-series modeling, where researchers frequently write out some model, for instance a toy AR(1) like
and then give a prior to  like

If  is greater than 1, then the time series  is explosive. The prior given above assumes that up-front, around 92% of the density is for values with an absolute value greater than 1.

Again: the researcher is saying that they give 92% probability to the series being explosive.
A more difficult example
A close friend has constructed a structural model that takes a set of behavioural parameters for each individual, , and maps these to the “utility” coming from contract, which we index with . If  is a vector of characteristics of the contract (for example, the amounts that need to be paid under various scenarios), then each contract is worth  utils to each individual, where  is iid Gumbel distributed. The aim is to estimate the joint distribution of  across individuals.
The nice thing about using Gumbel errors in a framework like this is that it makes the likelihood easy to construct. If  is the vector of utilities from each choice, then

The problem she encountered is that  contains large numbers, with payments in the thousands of dollars. Even with priors that might “look” fairly tight, like  can be an enormous number, and imply choice probabilities very close to 1 or 0. If we observe someone making a choice assigned a probability close to 0, the gradient of the log likelihood becomes very difficult to evaluate, and your model will have a tough time converging.

### So, what do we do?

There are a couple of approaches to helping to resolve these problems in applied modeling:
Doing applied work is hard, and choosing priors is important. But it’s an important part of your job, and frankly a small part of me dies every time I see someone write theta ~ normal(0, 100).