Saturday, September 10, 2016

Trump for President? Aggregating national polling data

A few weeks ago, just after the Democratic convention, Hillary’s poll numbers shot up. A week before that, Trump’s shot up by a similar degree, before falling. At time of writing, Hillary’s numbers have eased a few percent. It all seems a bit jumpy. Do people really change their mind so quickly only to reverse it a few days later? Or are we just over-interpreting noise? And which polls should we believe?
These are fun statistical questions, and so one morning on the subway to work I (Jim) put together a very simple little model to aggregate polls. The idea was extremely simple: by themselves, polls are noisy measures of some unobserved truth (the true distribution of political preferences). They are noisy partly because of sampling variation, partly because of biases in questioning or methodology, and perhaps even because respondents get swept up in the moment and give answers that mightn’t reveal their true preferences.
It makes sense to smooth out the noise. But how? I proposed a very simple state-space model of political preferences where on day , the preference share for candidate  would evolve according to a random walk with normal innovations. This gives us a very simple model of the unobserved state.

What we do observe are noisy measurments of the state—the polls, conducted by polling firm , expressed in a proportion 

A more sophisticated model would introduce a bias term in the mean of the measurement model to control for any systematic bias in the polls.
In putting together the model, my point was simply that the modeller can impose a small value of  (or very tight priors) and significantly smooth out the temporal variation in the polls, while capturing most of the long-run variation. And Stan makes it easy enough to do on the subway. You can see the scripts here and a discussion on Gelman’s blog here (the comments on the blog are superb).

Is this a good model?

Of course, the model I put together is insane. It has a random walk state, which is a fancy way of saying that we believe that the vote preference shares are unbounded. There is no polling firm bias. No controlling for bias in the sample. And there is no attempt at all to capture any predictability in the vote shares or their dynamics. We should do better.

Enter, Trangucci

A few of the comments on the blog made the point that since the number of days is pretty small, we might be able to give Gaussian Process priors to the time-varying component of preference shares, rather than the random walk prior. Rob Trangucci, a mate of mine on the Stan team, had some spare time and thought that this would be a good approach. Gaussian processes have the attractive property that they provide more certainty around the more commonly observed points, and more uncertainty away from them. They are also fantastic at capturing dynamics. They make these improvements at the cost of speed.
A Gaussian process on a time-series variable  can be thought of as the entire time-series being a single draw from a multivariate normal distribution (a Gaussian). The enormous covariance matrix is of course unidentified, so we impose constraints on its elements with a covariance function  using a much smaller number of parameters . We construct the time-series variable  to be related to candidate ’s preference share so that when  is higher, all else equal, so too will be candidate ’s preference share.

By itself, the Gaussian Process prior doesn’t impose boundedness on the preference share; we want a mapping between the variable  and . To achieve this, Rob suggested modeling the survey responses themselves. Where