Thursday, November 3, 2016

Random effects, partial pooling and exchangeability

I’ve had some feedback on two points of my zen that I’d like to clear up. First, the point of big data is to estimate rich models. Second, take advantage of partial pooling. The feedback has basically been “how?”
So here’s how.

Hierarchical data helps you do better causal analysis AND prediction …if you do it right

One of the nice things about working in industry is that we get to play with extremely rich/big data. We might observe customers, products and staff over time and geographies, allowing a huge amount of potential models. This data might look like this:
individualtimevalue
11-0.45255059
120.81522520
130.93875991
21-0.39965080
220.02931142
231.49530643
310.58661273
320.61238152
33-1.47199902
or like this:
regionrural/urbanvalue
SouthLittleton0.7203744
SouthLittleton-0.3914667
SouthTinyton-0.4180982
SouthTinyton-1.5864343
WestHumangaville-0.3516165
WestSmallville-0.1462636
WestHumangaville1.0616746
WestSmallville-0.3424441
They’re toy examples. But the general gist is that you have data that exists at multiple hierarchies. Sometimes these hierarchies are nested, as in the second example, while other times the groupings are latticed, as in the first. Or sometimes both.
There is an enormous amount of information in this hierarchical data. Yet sadly, analysts too commonly destroy this information by aggregating across hierarchies, or simply don’t take advantage of the nature of the data. Yes, these expensive “data scientists” are leaving $20 bills on the sidewalk.
What am I talking about? See if these sound familiar:
  • A credit risk analyst wants to judge the probability of a customer defaulting on a loan. They have a huge amount of individual-level repayment history, as well as personal information. This is panel data at its best. What happens (almost all the time) is that the analyst reduces each customer (for whom they have hundreds of rows of data) to a single row, fashioning a bunch of predictors from each customer’s repayment history and personal details. They then try to predict some binary outcome (normally if the customer misses payments for 90 days) from the predictors. No dynamics, no covariance between customers, nuthin’.
  • A “big data engineer” (scare quotes useful) is working on a recommender engine. They know a distributed computing framework like Hadoop or Spark, and have a lot of data. So they split the (enormous) data across many computers, run a model on each of the computers, and recombine the results. Good big data engineers will be cognizant of the hierarchy the data, processing similar observations together.
  • An economist is conducting a study on panel of survey data and “takes time and individual fixed effects”.
I’m not bagging these methods. They’ve been extremely useful for a long time. And good analysis with simple techniques can be far more useful than bad analysis with fancy techniques. But technology has progressed, and we can now do good analysis better.

A basic model

Through this post I’ll refer to a “hierarchical model”. To give some flavour to this, let’s just say I’m talking about a mixed normal linear model, which I’ll describe right here. This is not because I love the normal mixed linear model—it’s just an example, but one that is flexible enough to be useful for many problems. You could adapt everything that follows to help you with many different models, from big neural networks to gaussian processes to financial time series to structural IO models. Bear with me.
In the normal mixed linear model, we say that some individual  in group  has an outcome  and a vector of covariates . If it helps you, think of  as the dollars spent on coffee by customer  (who lives in town );  contains information about the customer—their gender, age, income and so on. If you’re more used to repeated observations, then  could equally be a time grouping. The model is just


where  is group ’s intercept,  is a vector of slope terms, and  is the regression residual, which in this simple model we’ll assume  is normally distrubuted with a mean of 0 and standard deviation of .
As you can see, in this general model we let  and  vary by group. This might be new to you, and it’s a part of the secret sauce discussed below. Allowing parameters to vary by groups and subgroups is what allows us to build amazingly rich models and make the most out of our data. But it’s not trivial, and we’ll get to that soon; first, how does having varying parameters at a group level help us?

How does having varying parameters at the group level help us?

Two words: unobserved information. In particular, unobserved information that is is fairly fixed within groups. This can take two forms.
The first is information that affects the value of covariates  and also affects . Such information is confounding in that it generates a non-causal correlation between covariates and outcomes (biasing our estimates of  away from their causal value). If this information is fairly fixed within a group, then group-varying intercepts can help to control for it. This is the same argument for fixed effects regression with panel data.
Taking our coffee example, perhaps some towns are more stressful to live in than others because they have high-paced work. The stress and pace of work is never observed, but it causes both higher incomes and more coffee sales. So if you observe that towns with higher incomes tend to drink more coffee, what would you make of an economic boom? Would you expect it to cause higher coffee sales? Not unless it affects work-pace and stress. Soaking up the unobserved information fixed within groups over time goes some way to fixing this problem (more on this below). To do this, we have group-varying intercepts.
The second is unobserved information that affects the relationship between a covariate and the outcome. More than anything, this is a relaxation of the assumption that slope parameters be the same for all groups. Perhaps the rich people in one town happen to be Mormon bankers who don’t drink coffee? Allowing slopes to vary across groups can help capture this additional information.

So how do we estimate these models?

Imagine a spectrum. On one end, we assume that all groups have the same set of parameters—no group-varying parameters. This is known as a “pooled” regression. In this case,  and  which just gives the normal linear model


At the other end of the spectrum, we estimate  for each of the groups by running a linear regression on each of the groups separately—so-called “un-pooled” regression. No information from any group other than  influences the parameter estimate for group . Conceptually, that’s similar to what people are doing with Hadoop and Spark today.
The approach we Bayesian advocate is a so-called “partial pooling” approach, which will result in parameter estimates somewhere in the middle of the spectrum. The intuition for the approach comes from a few observations:
  • Groups tend to mean-regress, and so the best parameters for use out-of-sample probably aren’t those from the un-pooled estimates no-matter the sample size—they’re probably closer to the pooled estimates. When I say “fixed effects normally aren’t”, this is what I mean.
  • Similar groups (similar in the sense that they’re nearby in the hierarchy) tend to behave similarly; it makes sense to let the parameter estimates from one group “borrow power” from similar groups.
  • We should be far more wary of drawing inference from small groups than large groups. And,
  • The relationships between parameter estimates across groups can give us important prior information for the values of parameters in small groups or those with noisy data.
The take-away from those points is that we probably want to be somewhere between the pooled and unpooled estimates. In practice this means being a good Bayesian and specifying a probability model for the data and parameters.
So we still have our model


And we have our parameters . Let’s let . Our prior distribution for  is


Where  is a vector of expected values for , and  is their covariance matrix. We can decompose the covariance into scale  and correlation . This parameterization is easier to give informative priors than for See here.
I used to gloss over the whole covariance of parameters thing. But it’s very important! Imagine we have many groups, each with only two parameters,  and . We run an un-pooled regression for every group and find that in one group we have a very precise estimate for its  but not its  (perhaps the group was small). Now imagine we learn that across groups the correlation between  and  is -0.8. This new information should help us get a better estimate for  in the troublesome group. Being explicit about the covariance between group-level parameters can help you benefit from this.
To implement a mixed (generalized) linear model like the one above, you might want to look at the rstanarm package in R. If you want to implement more adventurous models, then you should do it in Stan. Mike’s talk here will convince you that you can’t do this using Maximum Likelihood.

A warning: Exchangeability matters (common argument against random effects models)

Astute readers familiar with fixed effects models will have noted a problem with one of my arguments above. I said that we could use random intercepts to soak up unobserved information that affects both  and  by including group-varying intercepts . But this implies that the unobserved information fixed in a group, , is correlated with . This correlation violates a very important rule in varying-intercept, varying-slope models: exchangeability.
Exchangeability says that there should be no information other than the outcome  that should allow us to distinguish the group to which a group-level parameter belongs.
In this example, we can clearly use values of X to predict , violating exchangeability. But all is not lost. The group-varying parameter needs not be uncorrelated with X, only the random portion of it.

The Bafumi-Gelman correction

Imagine we have an exchangeability issue for a very simple model with only group-varying intercept: the unobserved information  is correlated with  across groups. Let’s break  down into its fixed and random portions.


where 


So that now the regression model can be written as


For the correlation to hold, it must be the case that  is correlated with . But our regression error is , which is clearly correlated with violating the Gauss-Markov theorem and so giving us biased estimates.
In a nice little paper Bafumi and Gelman suggest an elegant fix to this: simply control for group level averages in the model of . This is a Bayesian take on what econometricians might know as a Mundlak/Chamberlain approach. If  is the mean of  in group , then we could use the model


which results in the correlaton between  and  across groups being 0. It’s straightforward to implement, and gets you to conditional exchangeability—a condition under which mixed models like this one are valid.
Happy modeling!

5 comments:

  1. Would be great to see you work through an example - take a data set like the customer time series for example, and actually build/interpret a model in rstanarm or Stan.

    ReplyDelete
    Replies
    1. Ask and you shall receive! I have four 70%-finished posts in the backlog, but will get to your request after.

      Jim

      Delete
  2. Thanks James this is useful. For a study that I am doing I am interacting group specific dummies with the covariate, but some groups are small or noisy and this may be an alternative to get more reliable estimates.

    ReplyDelete
    Replies
    1. Almost certainly. Let me know how the out-of-sample checking goes!

      Delete
  3. Apologies for the typos. For some reason Blogger changes the formatting if I try to edit a post, so I'll just leave them there.

    ReplyDelete