Modern Statistical Workflow: What is a likelihood anyway?

It’s been a while since I’ve done an introductory post. As defining likelihoods for our data is a necessary part of being a Bayesian, I thought it’d be worth stepping through likelihoods (and then Bayesian inference) very slowly. I encourage you to run the R code here to help you get a grasp of the content.

What is a likelihood?

Let’s say we have a very simple linear model of

y

with a single covariate

x

α

β

and

σ

are the model’s parameters.

with

ϵ_{i} \sim N (0, σ)

. This is the same as saying that

y i \sim N (α + β x i, σ)

This is the model we are saying generates the data. It has three unknowns,

α

β

and

σ

. If we knew what these values were for sure, we could make probabilistic statements about what values

y_{i}

might take if we knew the value of

x_{i}

But that’s not how data analysis normally works. Normally we’re given values of

x

and

y

(or many

x

s and many

y

s) and have to infer the relationship between them, often with the help of some model. Our task is to estimate the values of the unknowns—in this case,

α

β

and

σ

So how do we do this by maximum likelihood? One thing that computers are really good at is choosing values of some unknowns in order to optimize the value of some function. So, what we want is some function—a way of scoring various combinations of

α

β

and

σ

so that the score is high when the model describes the data well, and low when it does not. One such way of creating a score is to evaluate the likelihood of the data given the parameters (and the proposed generative model).

Let’s say we propose a set of parameters,

α = 2

β = .5

and

σ = 1

and we have an observation

y_{i} = 1, x_{i} = 1

. Given the parameters and

x_{i}

we know that the outcomes

y

should be distributed as

y i \sim N (2 + .5 \times 1, 1) = N (2.5, 1)

which looks like this:

Now we ask: what was the density at the actual outcome

y_{i}

In R, we could go ahead and write a function that returns the density for a given observation. Because we’ll be optimizing this, we want to give it the unknowns as a single vector.

density_1 <- function(theta, # the unknown parameters
                      x, y) {
  # This function returns the density of a normal with mean theta[1] + theta[2]*x
  # and standard deviation exp(theta[3])
  # evaluated at y.
  dnorm(y, theta[1] + theta[2]*x, exp(theta[3]))
}

Now let’s evaluate the density at

x = 1, y = 1

for the given likelihood values. Note that we’ve included the scale (which must be positive) as

σ = \exp (θ_{3})

, as it must be positive. If we know

θ_{3}

, we can get back to

σ

by taking its log.

density_1(theta = c(2, .5, log(1)), 
          x = 1, y = 1)

## [1] 0.1295176

The value of the density at an observed outcome is the likelihood contribution

f (y_{i} | x_{i}, α, β, σ)

of a single datapoint. That’s precisely what we’ve just calculated above. You’ll be able to see that this likelihood contribution is maximized (at infinity, so what

α + β \times 1 = 1

and

σ = 0

# log(0) -> -Inf
density_1(c(0, 1, -Inf) , 1, 1)

## [1] Inf

Er Jim, we have more than one observation

We have one observation, three unknowns—of course this is an unidentified model. In reality we have many more than one observation. So how do we use this principle of likelihood to get estimates of the unknowns?

If we assume (conditional) independence of the draws–that is, the value of one observation’s

ϵ_{i}

has no influence on another’s

ϵ_{j}

, the sample likelihood of your data vector

y

is the product of all the individual likelihoods

\prod_{i = 1}^{N} f (y_{i} | x_{i}, α, β, σ)

You can probably see the problem here—for reasonably large spreads of the data, the value of the density for a datapoint is typically less than 1, and the product of many such datapoints will be an extremely small number. Too small, even, for a computer to deal well with. This might give us numerical issues. One way around this is to take the log of the likelihood—the Log Likelihood. Because the log of numbers in (0, 1) are negative numbers large in absolute value, the sum of these numbers will also have a large absolute value, so we’ll not get numerical issues in computation.

log (\prod i = 1 N f (y i | x i, a, b, s)) = \sum i = 1 N log (f (y i | x i, a, b, s))

So what does the likelihood mean?

Log likelihood is a confusing concept to beginners as the values of the numbers aren’t easily interpretable. You might run the analysis and get a number of -3477 and ask “is that good? Bad? What does it mean?” These numbers are more meaningful when conducting model comparison, that is, we get the out of sample log likelihood for two different models. Then we can compare them. The usefulness of this approach is best illustrated with an example.

Imagine that we have an outlier—rather than

y_{j} = 1

as in the image above, it is -10. Under the proposed model, we’re essentially saying that the such an outcome is all but impossible. The probability of observing an observation that low or lower is 3.73^{-36}. The density of such a point would be 4.69^{-35}—a very small number. But the log of its density is large in absolute value: -79.04. Compare that with the log likelihood of the original

y_{j} = 1

: -2.04. An outlier penalizes the log likelihood far more than a point at the mode.

This idea helps us do good modeling: we want to give positive weight to outcomes that happen, and no weight to impossible outcomes.

Maximum likelihood

Maximum likelihood estimators are simple: if the (log) likelihood is a unidimensional score of how well the data fit the model for a (potentially large) number of parameters, then we can simply run an optimizer that attempts to maximize this score by varying the values of the parameters. Such an estimator is called the maximum likelihood estimator.

Let’s get some practice at writing this out in R. Going through this exercise will help you understand what’s going on under the hood in Stan.

Suppose we have a a thousand observations.

x

is drawn from a standard normal,

α = 1

β = 2

and

σ = 3

y = α + β x + ϵ

, where

ϵ

has a zero-centered normal distribution with standard deviation (scale) 3.

alpha <- 1; beta <- 2; sigma <- 3
x <- rnorm(1000)
y <- rnorm(1000, alpha + beta * x, sigma)

Now that we have some fake data, let’s write out the log likelihood function, as before. Note that because the optimization function we use minimizes the function we need to returnt the negative of the log likelihood (minimizing the negative is the same as maximizing the likelihood).

negative_log_likelihood <- function(theta, # the unknown parameters
                      x, y) {
  # This function returns the log likelihood of a normal with mean theta[1] + theta[2]*x
  # and standard deviation exp(theta[3])
  # evaluated at a vector y.
  -sum(log(dnorm(y, theta[1] + theta[2]*x, exp(theta[3]))))
}

Now we optimize

estimates <- optim(par = c(1,1,1), negative_log_likelihood, x = x, y =y)

estimated_alpha <- estimates$par[1]
estimated_beta <- estimates$par[2]
estimated_sigma <- exp(estimates$par[3])

paste("estimate of alpha is", round(estimated_alpha, 2), "true value is", alpha)

## [1] "estimate of alpha is 1.13 true value is 1"

paste("estimate of beta is", round(estimated_beta, 2), "true value is", beta)

## [1] "estimate of beta is 1.94 true value is 2"

paste("estimate of sigma is", round(estimated_sigma, 2), "true value is", sigma)

## [1] "estimate of sigma is 2.98 true value is 3"

Hey presto! We just estimated a model by maximizing the likelihood by varying the parameters of our model keeping the data fixed. You should note that this method can be applied more generally in much larger, more interesting models.

Prior distributions

Prior distributions summarize our information about the values of parameters before seeing the data. For the uninitiated, this is the scary bit of building a Bayesian model, often because of fears of being biased, or having a poor understanding of how exactly the parameter influences the model. Such fears are often revealed by the choice of extremely diffuse priors, for instance

β \sim Normal (0, 1000)

Don’t be afraid of using priors. You almost always have high quality information about the values parameters might take on. For example:

Estimates from previous studies
Some unknowns have sign restrictions (fixed costs or discount rates probably aren’t negative; price elasticities of demand probably aren’t positive)
The knowledge that regularization can help prevent over-fitting
The scale of effects are probably known. Going to college probably won’t increase your income 100000%.

Prior distributions should be such that they put positive probabilistic weight on possible outcomes, and no weight on impossible outcomes. In the example above, the standard deviation of the residuals,

σ

must be positive. And so its prior should not have probabilistic weight below 0. That might guide the choice to a distribution like a half-Normal, half-Cauchy, half-Student’s t, inverse Gamma, lognormal etc.

Bayes rule

Bayes rule gives us a method for combining the information from our data (the likelihood) and our priors. It says that the (joint) probability density of our parameter vector

θ

p (θ | y) = p ( y | θ ) p ( θ ) p ( y )

where

p (y | θ)

is the likelihood, and

p (θ)

is the prior. We call

p (θ | y)

the posterior. Because

p (y)

doesn’t depend on the vector of unknowns,

θ

, we often express the posterior up to a constant of proportionality

p (θ | y) \propto p (y | θ) p (θ)

What can you take away from this? A few things:

If the prior or likelihood is equal to zero for a given value of $θ$ , so too will be the posterior
If the prior is very peaked, the posterior well be drawn towards the prior
If the likelihood is very peaked (as tends to happen when you have many observations per unknown), the posterior will be drawn towards the likelihood estimate.

Bayes’ rule also tells us that in order to obtain a posterior, we need to have a prior and a likelihood.

Using penalized likelihood estimation

Just as we can estimate a model using maximum likelihood, we can also estimate the mode of the posterior using penalized likelihood estimation. This is a really great way to start thinking about Bayesian estimation.

To estimate a model using penalized likelihood, we simply need to recognize that if

p (θ | y) \propto p (y | θ) p (θ)

then

log (p (θ | y)) \propto l o g (p (y | θ)) + log (p (θ))

That is, our log posterior density is proportional to the log likelihood plus the log prior density evaluated at a given value of

θ

. So we can choose

θ

to maximize our posterior by choosing a

θ

that maximizes the right hand side.

We do this in the function below. I’ve been verbose in my implementation to make it very clear what’s going on. The target is proportional to the accumulated log posterior density. To maximize the log posterior density, we minimize the negative target. We’ll choose priors

negative_penalized_log_likelihood <- function(theta, # the unknown parameters
                      x, y) {
  # This function returns a value proportional to the 
  # log posterior density of a normal with mean theta[1] + theta[2]*x
  # and standard deviation exp(theta[3])
  # evaluated at a vector y.
  
  # Initialize the target
  target <- 0
  
  # Add the density of the priors to the accumulator
  # Add the log density of the scale prior at the value (truncated normal)
  target <- target + log(truncnorm::dtruncnorm(exp(theta[3]), a = 0))
  # Add the log density of the intercept prior at the parameter value
  target <- target + log(dnorm(theta[1]))
  # Add the log density of the slope prior at the parameter value
  target <- target + log(dnorm(theta[2]))
  # Add the Log likelihood
  target <- target + sum(log(dnorm(y, theta[1] + theta[2]*x, exp(theta[3]))))
  
  # Return the negative target
  -target
}

Now we optimize

estimates <- optim(par = c(1,1,1), negative_penalized_log_likelihood, x = x, y =y)

estimated_alpha <- estimates$par[1]
estimated_beta <- estimates$par[2]
estimated_sigma <- exp(estimates$par[3])

paste("estimate of alpha is", round(estimated_alpha, 2), "true value is", alpha)

## [1] "estimate of alpha is 1.12 true value is 1"

paste("estimate of beta is", round(estimated_beta, 2), "true value is", beta)

## [1] "estimate of beta is 1.93 true value is 2"

paste("estimate of sigma is", round(estimated_sigma, 2), "true value is", sigma)

## [1] "estimate of sigma is 2.97 true value is 3"

And there you go! You’ve estimated your very first (very simple) model using penalized likelihood.

Before you go on and get too enthusiastic about estimating everything with penalized likelihood, be warned: the modal estimator will do fine in fairly low dimensions, and with many fairly simple models. But as your models become more complex you’ll find that it does not do a good job. This is partly because the mode of a high-dimensional distribution can be a long way from its expected value (which is what you really care about). But also because you often want to generate predictions that take into account the uncertainty you have in your model’s unknowns when making predictions. For this, you’ll need to draw samples from your posterior.

We’ll not get there just yet—that’s for another post. But just first, let’s take a quick look at what a posterior really looks like in practical terms.

What does a posterior look like?

In Bayesian inference we do not get point estimates for the unknowns in our models; we estimate their posterior distributions. Ideally, we’d want to be able to make posterior inference by asking questions like “what is the 5th percentile of the marginal posterior of

θ_{1}

?”. This would require that we can analytically derive these statistics from the posterior. Sadly, most posteriors do not have a closed form, and so we need to approximate. The two common types of approximation are:

To approximate the posterior with a joint density for which we can analytically evaluate quantiles, expected values etc. This is the approach in Variational Bayes and penalized likelihood.
To obtain many independent draws from the posterior, and evaluate quantiles, expected values of those draws. This is the approach in MCMC and ABC.

In theory, our posterior

p (θ | y)

is an abstract distribution; in practice (when using MCMC), it’s a matrix of data, where each column corresponds to a parameter, and each row a draw from

p (θ | y)

. For instance, the first five rows of our posterior matrix for our linear model above might look like:

a	b	sigma
1.96	0.48	0.98
2.06	0.52	0.86
2.06	0.54	1.02
1.90	0.48	1.02
1.95	0.45	0.92

Each of these parameter combinations implies a different predictive distribution. Consequently, they also imply different predictive likelihoods for our out of sample datapoint

y_{j} = 1, x_{j} = 1

. This is illustrated for 30 posterior replicates below.

The density of the data across posterior draws is known as the log posterior density of a datapoint. Unlike the likelihood, we are uncertain of its value and need to perform inference. To illustrate this, let’s look at a histogram of the log posterior density in the above example (but with 500 replicates, not 30):

Note that the log posterior density can be evaluated for each datapoint; if we have many datapoints for which we want to evaluate the LPD we simply need to evaluate it for each, and sum across all datapoints.

Conclusion

What I’ve hopefully demonstrated to you is what a likelihood is, and how it fits into Bayesian analysis. Without a likelihood, there is no posterior. The skill of deriving your likelihood and hand-coding up your own model is immensely useful, and maps very nicely to the techniques used in both advanced Bayesian statistics and mainstream machine learning.

Modern Statistical Workflow

Thursday, November 30, 2017

What is a likelihood anyway?

Jim Savage

30 November 2017