Modern Statistical Workflow: Your model gives you a lot of information about what your prior should(n’t) be

Earlier this week, an interesting young researcher (names withheld to protect the innocent) came by to get some advice on one of their models. They were implementing Mr P with a model of individual

i

’s decision to vote for some candidate (indicated by

y_{i} = 1

) with a sum-of-random-effects model. The idea with these models is that you have a bunch of demographic buckets, so that your data look like this:

Individual	Voted for candidate	Age bucket	Education	…
1	1	[18, 25)	< 9th grade	…
2	0	[25,34)	Masters	…
3	0	65+	High school	…
4	0	[18, 25)	Masters	…
5	1	[45,54)	Bachelors	…

Of course we can convert these strings into indexes like so.

Individual	Voted for candidate	Age bucket	Education	…
1	1	1	1	…
2	0	2	4	…
3	0	4	3	…
4	0	1	4	…
5	1	3	2	…

And then estimate a model like

y i \sim Bernoulli (Logit - 1 (α 0 + α 1, Age bucket i + α 2, Education i + \dots))

Where each of the

α_{p}

s (aside from p=0) are random effects centered around 0 and have their own scale, which we estimate.

The problem with this particular researcher’s implementation was that they had been encouraged to use extremely wide priors for the

α

s, like

α_{p} \sim Normal (0, σ_{p})

with

σ_{p} \sim {Normal}_{+} (0, 100)

What does that mean in terms of what the researcher considers to be probable? Well, it means for instance that they think that the scale of the random effects would have roughly a 1/3rd probability of being greater than 100. That might be OK in a linear regression, but for logit regression it makes no sense. For instance, it indicates that the researcher believes there’s about a 98% probability that a given demographic cell would vote for the candidate in more than 95% of cases! Remember—with logit, 4 is basically infinity.

A second example

Another similar example I see a bit of is in time-series modeling, where researchers frequently write out some model, for instance a toy AR(1) like

y t = α + β y t - 1 + ϵ t

and then give a prior to

β

β \sim Normal (0, 10)

| β |

is greater than 1, then the time series

y

is explosive. The prior given above assumes that up-front, around 92% of the density is for values with an absolute value greater than 1.

Again: the researcher is saying that they give 92% probability to the series being explosive.

A more difficult example

A close friend has constructed a structural model that takes a set of behavioural parameters for each individual,

θ_{i}

, and maps these to the “utility” coming from contract, which we index with

c \in 1 : C

. If

X_{c}

is a vector of characteristics of the contract (for example, the amounts that need to be paid under various scenarios), then each contract is worth

U_{i c} = u (X_{c}, θ_{i}) + ϵ_{i c}

utils to each individual, where

ϵ_{i t}

is iid Gumbel distributed. The aim is to estimate the joint distribution of

θ_{i}

across individuals.

The nice thing about using Gumbel errors in a framework like this is that it makes the likelihood easy to construct. If

U_{i}

is the vector of utilities from each choice, then

Prob (U i c = max (U i)) = exp ( u ( X c , θ i ) ) \sum c exp ( u ( X c , θ i ) )

The problem she encountered is that

X

contains large numbers, with payments in the thousands of dollars. Even with priors that might “look” fairly tight, like

θ_{i} \sim Normal (0, 1)

\exp (u (X_{c}, θ_{i}))

can be an enormous number, and imply choice probabilities very close to 1 or 0. If we observe someone making a choice assigned a probability close to 0, the gradient of the log likelihood becomes very difficult to evaluate, and your model will have a tough time converging.

So, what do we do?

There are a couple of approaches to helping to resolve these problems in applied modeling:

Learn about how your prior affects your model by simulation

Perhaps the most straightforward way to do this is to simulate first from your prior, then from your data generating process. There are easy ways to do this from within Stan.

What you will quickly realize is that some priors imply ridiculous things for your data. Unless you want to assign a probability that these ridiculous events are possible, think more carefully about these priors. In almost every case, a

Normal (0, 100)

prior is going to be doing you damage. But often a

Normal (0, 1)

prior will similarly imply weird things for your data.

Often it’s easier to express a prior on a space we can understand and map it to a constrained space

A few of the examples above are easily remedied by just scaling in the width of the priors. In other models —particularly time-series models—it’s not so easy. For example, what if we had an AR(2) model in example 2? A “stable” prior would be a (probably constrained) prior on the roots of the characteristic polynomial, which wouldn’t necessarily have an invertable relationship with the model parameters. This is difficult stuff!

One fairly commonly-used approach is to use some accept-reject method where we throw out combinations of parameters that do not imply stationarity, like in the famous Cogley Sargent (2001) paper. That method is fine from a theoretical viewpoint, but it does not scale to higher dimensions. Another, far better but more difficult approach is to define priors on a space that we understand—for example the partial autocorrelations—so that we have a mapping between those and the model parameters we care about. That’s the approach discussed by Zhang and McLeod here. This mapping needn’t be 1-1. A great example is Ben Goodrich’s prior for linear regression, which allows the researcher to express a prior over the

R^{2}

of the regression.

The prior is information that is not in your data

The reason I got into Bayesian statistica in the first place is because I model portfolios of debt in Sub-Saharan Africa. By definition, the clients who come to Lendable to raise financing have yet to experience a “killer” portfolio shock. That is, the shocks that we have observed in their portfolios are certainly smaller than the potential shocks out there. We know this—our clients are still in business! Yet when we securitize their book, we want to generate uncertainty bounds that take into account the size of future potential shocks. The scale of these shocks, of course, will be under-estimated by the data. That’s where informative priors—research!—comes in.

Even when you’re in the world of generating good priors through deep research, you still need to think about how they work within your model! This essay by Gelman, Simpson and Betancourt summarizes the current state of affairs at the bleeding edge.

Doing applied work is hard, and choosing priors is important. But it’s an important part of your job, and frankly a small part of me dies every time I see someone write theta ~ normal(0, 100).

3 comments:

UnknownOctober 15, 2017 at 6:32 PM
It is not fine from a theoretical viewpoint to throw out combinations of parameters that do not imply stationarity in Stan. For an AR(2) model ( https://en.wikipedia.org/wiki/Autoregressive_model#AR.282.29 ), if you put

if (varphi2 < -1 || phi2 >= 1 - fabs(varphi1)) reject("explosive");

in your model block, then priors like

varphi1 ~ normal(0,1); varphi2 ~ normal(0,1);

do not integrate to 1 over the non-explosive region of the parameter space. So, the resulting posterior distribution is not right. Rejecting draws is only valid if your truncate your priors appropriately or if varphi1 and varphi2 have improper uniform priors, and even then Stan probably won't be able to sample from the posterior distribution efficiently enough for you to trust the posterior mean estimates.

Modern Statistical Workflow

Sunday, October 15, 2017

Your model gives you a lot of information about what your prior should(n’t) be

Jim Savage

So, what do we do?

3 comments: