Saturday, January 27, 2018

Some good introductory machine learning resources in R

I didn't want to clog up a Twitter thread with a bunch of machine learning blogs/books/vignettes/software, but also thought an email to Scott wouldn't be useful to anyone else. So here are a few relatively-accessible resources that someone with a bit of math should be able to get through with ease. 

(Regularized) generalized linear models

This is an excellent worked vignette for regularize (generalized) linear models using the fantastic glmnet package in R: 

What you'll find is that for prediction, regularized glms with some feature engineering (interactions, bucketing, splines, combinations of all three) will typically give you similar predictive performance to random forests while maintaining interpretability and the possibility of estimating uncertainty (see below). That's why they're so popular. 

When you have high-dimensioned categorical predictors or natural groupings, it often doesn't make sense to one-hot encode them (ie. take fixed effects) in a regularized glm. Doing so will result in the same degree of regularization across grouping variables, which might be undesirable. In such a case you can often see huge improvements by simply using varying intercepts (and even varying slopes) in a Bayesian random effects model. The nice thing here is that because it's Bayesian, you get uncertainty for free. Well not free--you pay for it in the extra coal and time you'll burn fitting your model. But they're really pretty great. rstanarm implements these very nicely. 

In the above two methods, if you want to discover non-linearities by yourself, you have to cook your own non-linear features. But there are methods that do this quite well, while retaining the interpretability of linear models. The fantastic mgcv and rstanarm packages will fit Generalised Additive Models using maximum likelihood and MCMC-based techniques respectively. is a fun introduction


is a full vignette on implementation using various GAM packages. 

Tree-based methods

The obvious alternatives to regularized glms are tree-based methods and neural networks. A lot of industry folks, especially those who started life using proprietary packages, use SVMs too. Pedants who enjoy O(n^3) operations seem to get a weird kick out of Gaussian Processes. The point of all these methods is the same: to relax (or really, to automatically discover) non-linear relationships between features and outcomes. Tree-based methods and neural networks will also do well at discovering interactions too. Neural networks go a step further and uncover representations of your data which might be useful in themselves. 

To get a good understanding of tree-based methods, it makes sense to start at the beginning--with a simple classification and regression tree. I found this introducton pretty clear:

Once you understand CART, then Random Forests are probably the next step. The original Breiman piece is as good a place to start as any:

Next you should learn about tree-based additive models. These come in many varieties, but something close to the current state-of-the-art is implemented using xgboost. These techniques combined with smart feature engineering will work extremely well for a wide range of predictive problems. I incorporate them into my work to serve as a baseline that simpler models (for which we can get more sensible notions of uncertainty) should be able to get close to with enough work.

Net-based methods

Neural networks are of course all the rage, yet it's helpful to remember that they're really just tools for high-dimensional functional approximation. I found them hard to get into coming from an econometrics background (where notions like "maybe we should have more observations than unknowns in the model" are fairly common). But there are really just a few concepts to understand in order to get something working. 

I found David Mackay's chapters on them to be extremely easy to grasp. His whole, brilliant book is available for free here, with the relevant chapters starting at page 467:

Given you have some understanding now of what a neural network is and how they're fit, you can get down to fitting some. There are a few great high-level approaches, like Keras and, which are extremely easy to dive in with:


Note that these two approaches are great for fairly simple prediction tasks. If you want to make any real investment in deep learning for image/voice/NLP then you will find yourself working at a lower level (the analogy for statisticians would be going from rstanarm/brms to Stan proper), like Torch or TensorFlow. At this point you would probably be wise in asking yourself what you're doing in R--almost the entire AI community uses Python.

Even so, there is a reasonable API for TensorFlow available within R. I've not done a huge amount of playing outside of the tutorials, which seem well written.


If you know of any other great resources for someone--especially an economist--wanting to build their machine-learning chops, please drop them in the comments! 

Sunday, December 3, 2017

My rules for giving technical help

I spend a reasonable amount of time fielding questions from friends and acquaintances on how to solve modeling problems in Stan. This can be hugely valuable and enjoyable for both me and the other person. I can't state this enough. Yet it can also be exasperating, especially if the person asking for help hasn't done the necessary ground-work to understand the complex model they want to fit.

So here are my rules on giving help:

  • No syntax questions. This is mostly because other people have asked syntax questions on Stan Users or will benefit from your question in the future. Also, the Stan manual is incredible. 
  • If you haven't started with a simple model first, I won't help. I need to be able to see a fit from a simple model, and all iterations you've made to build up to the complex model you're having trouble with.
  • Similarly, if you haven't simulated your model with fake data (I've written one really easy way to do this here) and tried to estimate the "known unknowns", I won't help. Normally doing this teaches everything you need for most problems. If you've done this and are still having problems (and you've been successful in doing this with simpler iterations of the model in the previous step), then you should ask me stuff! 
If you have done these three things, you'll probably find me to be incredibly happy to help you fit interesting models.

[Edit] The exception is when you're at the start of a modeling project with a "how would one model this?" sort of question. They are enjoyable. 


Thursday, November 30, 2017

What is a likelihood anyway?

It’s been a while since I’ve done an introductory post. As defining likelihoods for our data is a necessary part of being a Bayesian, I thought it’d be worth stepping through likelihoods (and then Bayesian inference) very slowly. I encourage you to run the R code here to help you get a grasp of the content.

What is a likelihood?

Let’s say we have a very simple linear model of  with a single covariate  and  are the model’s parameters.

with . This is the same as saying that

This is the model we are saying generates the data. It has three unknowns,  and . If we knew what these values were for sure, we could make probabilistic statements about what values  might take if we knew the value of .
But that’s not how data analysis normally works. Normally we’re given values of  and  (or many s and many s) and have to infer the relationship between them, often with the help of some model. Our task is to estimate the values of the unknowns—in this case,  and .
So how do we do this by maximum likelihood? One thing that computers are really good at is choosing values of some unknowns in order to optimize the value of some function. So, what we want is some function—a way of scoring various combinations of  and  so that the score is high when the model describes the data well, and low when it does not. One such way of creating a score is to evaluate the likelihood of the data given the parameters (and the proposed generative model).
Let’s say we propose a set of parameters,  and  and we have an observation . Given the parameters and  we know that the outcomes  should be distributed as

which looks like this:

Now we ask: what was the density at the actual outcome ?

In R, we could go ahead and write a function that returns the density for a given observation. Because we’ll be optimizing this, we want to give it the unknowns as a single vector.
density_1 <- function(theta, # the unknown parameters
                      x, y) {
  # This function returns the density of a normal with mean theta[1] + theta[2]*x
  # and standard deviation exp(theta[3])
  # evaluated at y.
  dnorm(y, theta[1] + theta[2]*x, exp(theta[3]))
Now let’s evaluate the density at  for the given likelihood values. Note that we’ve included the scale (which must be positive) as , as it must be positive. If we know , we can get back to  by taking its log.
density_1(theta = c(2, .5, log(1)), 
          x = 1, y = 1)
## [1] 0.1295176
The value of the density at an observed outcome is the likelihood contribution  of a single datapoint. That’s precisely what we’ve just calculated above. You’ll be able to see that this likelihood contribution is maximized (at infinity, so what  and .
# log(0) -> -Inf
density_1(c(0, 1, -Inf) , 1, 1)
## [1] Inf

Er Jim, we have more than one observation
We have one observation, three unknowns—of course this is an unidentified model. In reality we have many more than one observation. So how do we use this principle of likelihood to get estimates of the unknowns?
If we assume (conditional) independence of the draws–that is, the value of one observation’s  has no influence on another’s , the sample likelihood of your data vector  is the product of all the individual likelihoods .
You can probably see the problem here—for reasonably large spreads of the data, the value of the density for a datapoint is typically less than 1, and the product of many such datapoints will be an extremely small number. Too small, even, for a computer to deal well with. This might give us numerical issues. One way around this is to take the log of the likelihood—the Log Likelihood. Because the log of numbers in (0, 1) are negative numbers large in absolute value, the sum of these numbers will also have a large absolute value, so we’ll not get numerical issues in computation.

So what does the likelihood mean?

Log likelihood is a confusing concept to beginners as the values of the numbers aren’t easily interpretable. You might run the analysis and get a number of -3477 and ask “is that good? Bad? What does it mean?” These numbers are more meaningful when conducting model comparison, that is, we get the out of sample log likelihood for two different models. Then we can compare them. The usefulness of this approach is best illustrated with an example.
Imagine that we have an outlier—rather than  as in the image above, it is -10. Under the proposed model, we’re essentially saying that the such an outcome is all but impossible. The probability of observing an observation that low or lower is 3.73^{-36}. The density of such a point would be 4.69^{-35}—a very small number. But the log of its density is large in absolute value: -79.04. Compare that with the log likelihood of the original : -2.04. An outlier penalizes the log likelihood far more than a point at the mode.
This idea helps us do good modeling: we want to give positive weight to outcomes that happen, and no weight to impossible outcomes.
Maximum likelihood
Maximum likelihood estimators are simple: if the (log) likelihood is a unidimensional score of how well the data fit the model for a (potentially large) number of parameters, then we can simply run an optimizer that attempts to maximize this score by varying the values of the parameters. Such an estimator is called the maximum likelihood estimator.
Let’s get some practice at writing this out in R. Going through this exercise will help you understand what’s going on under the hood in Stan.
Suppose we have a a thousand observations.  is drawn from a standard normal,  and , where  has a zero-centered normal distribution with standard deviation (scale) 3.
alpha <- 1; beta <- 2; sigma <- 3
x <- rnorm(1000)
y <- rnorm(1000, alpha + beta * x, sigma)
Now that we have some fake data, let’s write out the log likelihood function, as before. Note that because the optimization function we use minimizes the function we need to returnt the negative of the log likelihood (minimizing the negative is the same as maximizing the likelihood).
negative_log_likelihood <- function(theta, # the unknown parameters
                      x, y) {
  # This function returns the log likelihood of a normal with mean theta[1] + theta[2]*x
  # and standard deviation exp(theta[3])
  # evaluated at a vector y.
  -sum(log(dnorm(y, theta[1] + theta[2]*x, exp(theta[3]))))
Now we optimize
estimates <- optim(par = c(1,1,1), negative_log_likelihood, x = x, y =y)

estimated_alpha <- estimates$par[1]
estimated_beta <- estimates$par[2]
estimated_sigma <- exp(estimates$par[3])

paste("estimate of alpha is", round(estimated_alpha, 2), "true value is", alpha)
## [1] "estimate of alpha is 1.13 true value is 1"
paste("estimate of beta is", round(estimated_beta, 2), "true value is", beta)
## [1] "estimate of beta is 1.94 true value is 2"
paste("estimate of sigma is", round(estimated_sigma, 2), "true value is", sigma)
## [1] "estimate of sigma is 2.98 true value is 3"
Hey presto! We just estimated a model by maximizing the likelihood by varying the parameters of our model keeping the data fixed. You should note that this method can be applied more generally in much larger, more interesting models.

Prior distributions

Prior distributions summarize our information about the values of parameters before seeing the data. For the uninitiated, this is the scary bit of building a Bayesian model, often because of fears of being biased, or having a poor understanding of how exactly the parameter influences the model. Such fears are often revealed by the choice of extremely diffuse priors, for instance .
Don’t be afraid of using priors. You almost always have high quality information about the values parameters might take on. For example:
  • Estimates from previous studies
  • Some unknowns have sign restrictions (fixed costs or discount rates probably aren’t negative; price elasticities of demand probably aren’t positive)
  • The knowledge that regularization can help prevent over-fitting
  • The scale of effects are probably known. Going to college probably won’t increase your income 100000%.
Prior distributions should be such that they put positive probabilistic weight on possible outcomes, and no weight on impossible outcomes. In the example above, the standard deviation of the residuals,  must be positive. And so its prior should not have probabilistic weight below 0. That might guide the choice to a distribution like a half-Normal, half-Cauchy, half-Student’s t, inverse Gamma, lognormal etc.

Bayes rule

Bayes rule gives us a method for combining the information from our data (the likelihood) and our priors. It says that the (joint) probability density of our parameter vector  is

where  is the likelihood, and  is the prior. We call  the posterior. Because  doesn’t depend on the vector of unknowns, , we often express the posterior up to a constant of proportionality

What can you take away from this? A few things:
  • If the prior or likelihood is equal to zero for a given value of , so too will be the posterior
  • If the prior is very peaked, the posterior well be drawn towards the prior
  • If the likelihood is very peaked (as tends to happen when you have many observations per unknown), the posterior will be drawn towards the likelihood estimate.
Bayes’ rule also tells us that in order to obtain a posterior, we need to have a prior and a likelihood.

Using penalized likelihood estimation

Just as we can estimate a model using maximum likelihood, we can also estimate the mode of the posterior using penalized likelihood estimation. This is a really great way to start thinking about Bayesian estimation.
To estimate a model using penalized likelihood, we simply need to recognize that if

That is, our log posterior density is proportional to the log likelihood plus the log prior density evaluated at a given value of . So we can choose  to maximize our posterior by choosing a  that maximizes the right hand side.

We do this in the function below. I’ve been verbose in my implementation to make it very clear what’s going on. The target is proportional to the accumulated log posterior density. To maximize the log posterior density, we minimize the negative target. We’ll choose priors

negative_penalized_log_likelihood <- function(theta, # the unknown parameters
                      x, y) {
  # This function returns a value proportional to the 
  # log posterior density of a normal with mean theta[1] + theta[2]*x
  # and standard deviation exp(theta[3])
  # evaluated at a vector y.
  # Initialize the target
  target <- 0
  # Add the density of the priors to the accumulator
  # Add the log density of the scale prior at the value (truncated normal)
  target <- target + log(truncnorm::dtruncnorm(exp(theta[3]), a = 0))
  # Add the log density of the intercept prior at the parameter value
  target <- target + log(dnorm(theta[1]))
  # Add the log density of the slope prior at the parameter value
  target <- target + log(dnorm(theta[2]))
  # Add the Log likelihood
  target <- target + sum(log(dnorm(y, theta[1] + theta[2]*x, exp(theta[3]))))
  # Return the negative target
Now we optimize
estimates <- optim(par = c(1,1,1), negative_penalized_log_likelihood, x = x, y =y)

estimated_alpha <- estimates$par[1]
estimated_beta <- estimates$par[2]
estimated_sigma <- exp(estimates$par[3])

paste("estimate of alpha is", round(estimated_alpha, 2), "true value is", alpha)
## [1] "estimate of alpha is 1.12 true value is 1"
paste("estimate of beta is", round(estimated_beta, 2), "true value is", beta)
## [1] "estimate of beta is 1.93 true value is 2"
paste("estimate of sigma is", round(estimated_sigma, 2), "true value is", sigma)
## [1] "estimate of sigma is 2.97 true value is 3"
And there you go! You’ve estimated your very first (very simple) model using penalized likelihood.
Before you go on and get too enthusiastic about estimating everything with penalized likelihood, be warned: the modal estimator will do fine in fairly low dimensions, and with many fairly simple models. But as your models become more complex you’ll find that it does not do a good job. This is partly because the mode of a high-dimensional distribution can be a long way from its expected value (which is what you really care about). But also because you often want to generate predictions that take into account the uncertainty you have in your model’s unknowns when making predictions. For this, you’ll need to draw samples from your posterior.
We’ll not get there just yet—that’s for another post. But just first, let’s take a quick look at what a posterior really looks like in practical terms.

What does a posterior look like?

In Bayesian inference we do not get point estimates for the unknowns in our models; we estimate their posterior distributions. Ideally, we’d want to be able to make posterior inference by asking questions like “what is the 5th percentile of the marginal posterior of ?”. This would require that we can analytically derive these statistics from the posterior. Sadly, most posteriors do not have a closed form, and so we need to approximate. The two common types of approximation are:
  1. To approximate the posterior with a joint density for which we can analytically evaluate quantiles, expected values etc. This is the approach in Variational Bayes and penalized likelihood.
  2. To obtain many independent draws from the posterior, and evaluate quantiles, expected values of those draws. This is the approach in MCMC and ABC.
In theory, our posterior  is an abstract distribution; in practice (when using MCMC), it’s a matrix of data, where each column corresponds to a parameter, and each row a draw from . For instance, the first five rows of our posterior matrix for our linear model above might look like:
Each of these parameter combinations implies a different predictive distribution. Consequently, they also imply different predictive likelihoods for our out of sample datapoint . This is illustrated for 30 posterior replicates below.

The density of the data across posterior draws is known as the log posterior density of a datapoint. Unlike the likelihood, we are uncertain of its value and need to perform inference. To illustrate this, let’s look at a histogram of the log posterior density in the above example (but with 500 replicates, not 30):

Note that the log posterior density can be evaluated for each datapoint; if we have many datapoints for which we want to evaluate the LPD we simply need to evaluate it for each, and sum across all datapoints.


What I’ve hopefully demonstrated to you is what a likelihood is, and how it fits into Bayesian analysis. Without a likelihood, there is no posterior. The skill of deriving your likelihood and hand-coding up your own model is immensely useful, and maps very nicely to the techniques used in both advanced Bayesian statistics and mainstream machine learning.