A very fun class of problems has the following structure: we observe some data, but there are several mechanisms that may have generated it. We see someone’s height, and need to infer their gender. Or we see some monthly unemployment figures and have to work out whether we’re in a low-growth scenario or not. One modeling framework that helps us think about these problems is the finite mixture model. Explicitly, the model looks like this:
We first draw an index that tells us which model the data will be drawn from. Each state has probability θk, collected in Θ with ∑kθk=1:
Once we know which model observation i is going to come from, we can draw the observations. If the data generating process is a normal with location μk and scale σk, then
Of course the generative model for yi needn’t be such a simple model, or even a model of the same structure. All it needs to do is have distributional implications for the outcome data y.
So that’s the generative model. As you know, I’m a big fan of Modern Statistical Workflow. So let’s simulate some data in R so that there’s no confusion.
library(dplyr); library(ggplot2); library(ggthemes)# Number of data pointsN<-400# Let's make three statesmu<-c(3, 6, 9)sigma<-c(2, 4, 3)# with probabilityTheta<-c(.5, .2, .3)# Draw which model each belongs toz<-sample(1:3, size=N, prob=Theta, replace=T)# Some white noiseepsilon<-rnorm(N)# Simulate the data using the fact that y ~ normal(mu, sigma) can be # expressed as y = mu + sigma*epsilon for epsilon ~ normal(0, 1)y<-mu[z]+sigma[z]*epsilondata_frame(y, z=as.factor(z))%>%ggplot(aes(x=y, fill=z))+geom_density(alpha=0.3)+theme_economist()+ggtitle("Three data generating processes")
What should you take from the above? The big point is that if you observe a value of 7, it might very plausibly have come from any of the models; likewise a value of 15 probably did not come from the first model, and a value of 0 probably came from the first. This brings us to the likelihood.
From generative model to likelihood
Recall that the likelihood contribution of a data point yi is the height of the generative density for a given set of parameters. But here we have several sets of parameters. How do we deal with that?
Well, if the probability of being drawn from model 1 is θ1 and so on, we can write out the likelihood as
where the notation normal(yi|μk,σk) refers to the height of the normal density at yi with the given parameters. The above is the likelihood contribution of a single datapoint. The likelihood of the whole model (under the assumption that observations are independent) is the product of individual likelihood contributions. Of course, this is a very small number, which gives us computational problems. So we prefer to work on the log scale (allowing us to sum log likelihood contributions, rather than multiplying likelihood contributions).
Now how should we evaluate the inside of the log() on the right?
Log sum exp to the rescue
A very handy transformation here is the log_sum_exp() function, which, as its name suggests, is the log of summed exponents. It is defined as:
log_sum_exp(a, b) = log(exp(a) + exp(b))
Now look at our log likeihood above. We could re-write it as
That model appeared to estimate very cleanly. But if we print the parameters estimates, we see that all the Rhats are large an n_effs are low. That’s bad! So we might check out the traceplots to see what’s going on.
Oh dear! We have a case of the dreaded label switching. With these mixture models, the parameters aren’t identified up to an index. That is, we could change all the indices of the parameters, the likelihood would be unchanged. Think about it this way: we said that state 1 (μ=3, σ=2) had probability .5. But we could equally call it “state 2” and it wouldn’t affect the model at all so long as we slot the state 2 parameters into index 1. This “label” switching happens when the labels all change while the model is running.
So what do we do? There are a few tricks. One is to declare one of the parameters as an ordered vector, so long as there is theoretical support for doing so. We can do this by replacing the line
This estimates far more cleanly, as we can see from the parameters that now cover the generative process, and a much nicer looking traceplot.