Monday, March 12, 2018

1000 Labels and 4500 observations. Have you tried modeling your (30) features?

TL;DR

This post describes a generative model for classification when then number of potential classes is very large, both relative to the number of observations and the number of features. I set up some fake data with 1000 labels, 4500 observations and 30 features. On 500 hold-out cases, the model generates predictions with around 43.5% accuracy. By contrast, a stock random forest has 31% accuracy, and an evening of fart-arsing around in Keras struggles to beat that.
The reason for the better classification is because we explicitly model the features as emerging from the label, in such a way that the model can be inverted and we can recover labels from the features. This approach makes good use of your limited data. 

Some background

Last week I was at Sawtooth Conference in Orlando where I finally met the wonderful Elea McDonnell Feit. Elea had an interesting problem. Occasionally, we have data that look something like this:
User idActivity 1Activity 2Activity 3Activity 4Activity 5
4900101
27710101
NA11100
NA10001
65501010
That is, we have activity data, but only sometimes observe the User IDs, but not always. For instance, you might run news website, and users sometimes log onto your site in incognito mode to avoid the paywall. While on your site, they engage in “activities”, which would be any binary variables that describe user behaviour, such as their operating system choice, browser, parts of the site they visit etc. Given a (potentially incomplete) observation of user behaviour, we want to generate a probabilistic prediction of their User ID.

A generative model

The model is something like the following:
We have users indexed by  engaging in activities . The probability that user  engages in activity  during a visit is . We observe  site visits, with 

set.seed(47)

N <- 5000
I <- 1000
A <- 30

user <- rep(NA, I)
activities <- matrix(NA, N, A)

# Draw some generative values of alpha
alpha <- matrix(rnorm(I*A, 0, 2), I, A)

# Draw some 0-1 activities with probabilities given by logit^-1(alpha)
for(n in 1:N) {
  # Draw the user
  user[n] <- sample(1:I, 1)
  # Given her alphas, draw an activity profile
  activities[n,] <- as.numeric(rbinom(A, 1, arm::invlogit(alpha[user[n],])))
}

# Now let's make some test data
user_2 <- user
user_2[(N-500):N] <- -1 # We use -1 in place of NA so that Stan doesn't mind

This gives us data that look like so (the crossover between labelled and unlabelled data shown):
User IDV1V2V3V4V5V6
809000101
994111110
909011101
-1010110
-1010001
-1101001

Estimating with a Random Forest

If you’re told to generate predictions for data that look like this, it’d be pretty natural to go to a workhorse model from machine learning, like a Random Forest. Let’s see how this goes
library(randomForest)
# Fit the model

if("model_fits.RData" %in% dir()) {
  load("model_fits.RData")
} else {
  rf_fit <- randomForest(factor(user_2[user_2>0]) ~ ., data = as.data.frame(activities[user_2>0,]))
}


# Make some predictions for the hold-out set
predictions <- data.frame(predictions = predict(rf_fit, newdata = as.data.frame(activities)),
                  actuals = user, 
                  i2 = user_2) %>% 
  filter(i2<0)

# What proportion of predictions were correct?
acc <- mean(predictions$predictions== predictions$actuals)
So there we go: predictive accuracy of 31.34% on hold-out data–pretty good! Especially given the huge number of potential labels, small number of observations, and small number of features, I’m extremely impressed. Guessing at random will give us a .1% accuracy.

Can we do better?

A generative classifier

Another way of classifying observations is to build a generative model of our activities, not the labels. Under this paradigm, each user  has their own set of parameters . If we apply the inverse Logit function, element-wise, to this vector we get the probabilities of each action for person  (the inverse logit function simply converts each alpha into a probability). Our assumption is that a vector of actions during some visit  (we assume that each user might visit multiple times),  is distributed Bernoulli with these probabilities

Once we make this modeling assumption, we can do three things:
  • For a given set of parameters, compute the likelihood of a labelled observation 
  • For an unlabelled observation, compute what its likelihood contribution would have been had it come from the model associated with every other users’ parameters. From this we can calculate the probability of it being an observation from that user.
  • Calculate the log likelihood of the model and estimate the parameters by Bayesian methods or, as I do below, using Penalized likelihood.

1. The likelhood of a labelled observation

Let’s say we have a labelled observation , that is, one for which we know the user. There are an associated set of parameters  with this user. With the model above, we can calculate the likelihood contribution of this observation.

Where  is the ’th element of , and is either 0 or 1. Taking the log gives us the log likelihood contribution of the observation.

2. Which user does an unlabelled observation belong to?

In order to construct the log likelihood of the data, we must work out the likelihood contribution of the labelled and unlabelled observations.
The likelihood contribution of an unlabelled observation is the weighted average of likelihood contributions under the proposed models, where the weights are the probability that the observation came from those models. But where do we get this probability?
Recall above that the likelihood of an observation belonging to some unlabelled person  given a set of parameters belonging to  is


Well the probability that it belongs to person  is simply the relative likelihood


This might look familiar! It’s the softmax of the possible log likelihoods of across all possible s.


So all we need to do is for every unlabelled observation, calculate the log likelihood contribution across all possible s, and take the softmax. This is precisely what happens in the transformed parameters block in the code below.

3. Calculating the full log likelihood

The full log likelihood is simply the sum of the log likelihood contributions of each observations. We know how to do this already for the labelled observations, but how about the unlabelled ones?
The likelihood (not log likelhood) contributions of an unlabelled observation is the weighted average likelihood for each possible users’ model, where the weights are calculated as above. That is


Therefore the log likelhood is simply the sum of this


This is precisely the situation in which we want to wheel out the log_sum_exp() function, which is defined as
log_sum_exp(x) = log(exp(x[1]) + exp(x[2]) + ...)
Except here, the x[1] is our log likelihood for  and so on. This is implemented in Stan below.

Some Stan code

Below I implement the estimation of the model described, in Stan. 

// saved as generative_classifier.stan
data {
  int N; // number of observations
  int I; // number of users
  int A; // number of activities
  
  int user[N]; // user indicator for every observation; -1 if user is missing
  
  matrix[N, A] activities;
}
parameters {
  matrix[I, A] alpha;
}
transformed parameters {
  matrix[N, I] user_probabilities = rep_matrix(0.0, N, I);
  
  for(n in 1:N) {
    // the user is known
    if(user[n] > 0) {
      user_probabilities[n, user[n]] = 1.0;
    } else {
      // the user is unknown
      vector[I] tmp;
      // loop through users
      for(i in 1:I) {
        tmp[i] = log(activities[n] * inv_logit(alpha[i]') + (1.0 - activities[n]) * (1.0 - inv_logit(alpha[i]')));
      }
      user_probabilities[n] = softmax(tmp)';
    }
  }
}
model {
  // Give prior to alpha
  to_vector(alpha) ~ normal(0, 1);
  
  // calculate the log likelihood
  for(n in 1:N) {
    if(user[n] > 0) {
      // log likelihood in the case that we observe the user
      target += activities[n] * log(inv_logit(alpha[user[n]]')) + 
                (1.0 - activities[n]) * log(1.0 - inv_logit(alpha[user[n]]'));
    } else {
      // log likelihood in the case we do not observe the user
      target += log_sum_exp(log(user_probabilities[n]') + // probability of being each user
                            (activities[n] * log(inv_logit(alpha')) + // plus the log likelihood in each state
                            (1.0 - activities[n]) * log((1.0 - inv_logit(alpha'))))');
    } 
  }
}

Running it in R

Finally we compile and fit it in R. On my old Macbook pro, it fits in a few minutes for the example, which was far less than my abortive attempts with xgboost and far, far less than my evening spent trying to get a neural network to do anything sensible at all.

Note that I'm using Penalized Likelihood via optimizing() function, rather than sampling(). I'd normally use sampling(), which runs MCMC. Yet for such high-dimensions (30k parameters and 5M probability predictions), storing a thousand MCMC draws is just not practical. I also checked on some smaller datasets and found that MCMC, Variational Bayes and Penalized Likelihood all return very similar estimates. 
library(rstan)
compiled_model <- stan_model("generative_classifier.stan")

# Let's put some missing values (encoded as -1) in the individual/type index

# Create a list of data for Stan
data_list <- list(N = N, 
                  I = I, 
                  A = A, 
                  user = user_2,
                  activities = activities)

# Fit the model

if(!"model_fits.RData" %in% dir()) {
  optim_est <- optimizing(compiled_model, data_list)
  save(rf_fit, optim_est, file = "model_fits.RData")
}

How did we go?

Let’s get the implied probablities and see how often the highest probability went to the true user.
# Extract the fitted probabilities
implied_probs <- optim_est$par[grepl("user_probabilities", names(optim_est$par))]
implied_probs <- matrix(implied_probs, N, I) %>% as_data_frame()
implied_probs <- implied_probs[user_2<0,]

# What proportion were predicted correctly? 
acc_2 <- mean(apply(implied_probs, 1, which.max) == user[user_2<0])

This fairly simple model with a mere 30k parameters manages far better performance than the random forest, classifying 43.51% of observations correctly. That’s pretty phenomenal, given the problem.

Extensions

The point of this post is that modeling your features can help you massively improve your classification models when you don’t have a huge amount of data.
Of course, we don’t only have set such a problem up with binary activities. You could equally have continuous variables, or ordinal variables, categorical variables etc. The likelihood would be a bit different, but the concept the same.
I’d be especially keen to hear from my friends in machine learning about how they might tackle the same problem.
Jim