#
*Jim Savage*

####
*30 November 2017*

It’s been a while since I’ve done an introductory post. As defining likelihoods for our data is a necessary part of being a Bayesian, I thought it’d be worth stepping through likelihoods (and then Bayesian inference) very slowly. I encourage you to run the R code here to help you get a grasp of the content.

### What is a likelihood?

Let’s say we have a very simple linear model of with a single covariate . , and are the model’s parameters.

with . This is the same as saying that

This is the model we are saying

*generates*the data. It has three unknowns, , and . If we knew what these values were for sure, we could make probabilistic statements about what values might take if we knew the value of .
But that’s not how data analysis normally works. Normally we’re given values of and (or many s and many s) and have to infer the relationship between them, often with the help of some model. Our task is to estimate the values of the unknowns—in this case, , and .

So how do we do this by maximum likelihood? One thing that computers are really good at is choosing values of some unknowns in order to optimize the value of some function. So, what we want is some function—a way of scoring various combinations of , and so that the score is high when the model describes the data well, and low when it does not. One such way of creating a score is to evaluate the

*likelihood*of the data given the parameters (and the proposed generative model).
Let’s say we propose a set of parameters, , and and we have an observation . Given the parameters and we know that the outcomes should be distributed as

which looks like this:

Now we ask: what was the density at the

*actual*outcome ?
In

`R`

, we could go ahead and write a function that returns the density for a given observation. Because we’ll be optimizing this, we want to give it the unknowns as a single vector.```
density_1 <- function(theta, # the unknown parameters
x, y) {
# This function returns the density of a normal with mean theta[1] + theta[2]*x
# and standard deviation exp(theta[3])
# evaluated at y.
dnorm(y, theta[1] + theta[2]*x, exp(theta[3]))
}
```

Now let’s evaluate the density at for the given likelihood values. Note that we’ve included the scale (which must be positive) as , as it must be positive. If we know , we can get back to by taking its log.

```
density_1(theta = c(2, .5, log(1)),
x = 1, y = 1)
```

`## [1] 0.1295176`

The value of the density at an observed outcome is the

*likelihood contribution*of a single datapoint. That’s precisely what we’ve just calculated above. You’ll be able to see that this likelihood contribution is maximized (at infinity, so what and .```
# log(0) -> -Inf
density_1(c(0, 1, -Inf) , 1, 1)
```

`## [1] Inf`

Er Jim, we have more than one observation

We have one observation, three unknowns—of course this is an unidentified model. In reality we have many more than one observation. So how do we use this principle of likelihood to get estimates of the unknowns?

If we assume (conditional) independence of the draws–that is, the value of one observation’s has no influence on another’s , the sample likelihood of your data vector is the

*product*of all the individual likelihoods .
You can probably see the problem here—for reasonably large spreads of the data, the value of the density for a datapoint is typically less than 1, and the product of many such datapoints will be an extremely small number. Too small, even, for a computer to deal well with. This might give us numerical issues. One way around this is to take the log of the likelihood—the Log Likelihood. Because the log of numbers in (0, 1) are negative numbers large in absolute value, the sum of these numbers will also have a large absolute value, so we’ll not get numerical issues in computation.

### So what does the likelihood mean?

Log likelihood is a confusing concept to beginners as the values of the numbers aren’t easily interpretable. You might run the analysis and get a number of -3477 and ask “is that good? Bad? What does it mean?” These numbers are more meaningful when conducting model comparison, that is, we get the out of sample log likelihood for two different models. Then we can compare them. The usefulness of this approach is best illustrated with an example.

Imagine that we have an outlier—rather than as in the image above, it is -10. Under the proposed model, we’re essentially saying that the such an outcome is all but impossible. The probability of observing an observation that low or lower is 3.73^{-36}. The density of such a point would be 4.69^{-35}—a very small number. But the log of its density is large in absolute value: -79.04. Compare that with the log likelihood of the original : -2.04. An outlier penalizes the log likelihood far more than a point at the mode.

This idea helps us do good modeling: we want to give positive weight to outcomes that happen, and no weight to impossible outcomes.

Maximum likelihood

Maximum likelihood estimators are simple: if the (log) likelihood is a unidimensional score of how well the data fit the model for a (potentially large) number of parameters, then we can simply run an optimizer that attempts to maximize this score by varying the values of the parameters. Such an estimator is called the maximum likelihood estimator.

Let’s get some practice at writing this out in R. Going through this exercise will help you understand what’s going on under the hood in Stan.

Suppose we have a a thousand observations. is drawn from a standard normal, , and . , where has a zero-centered normal distribution with standard deviation (scale) 3.

```
alpha <- 1; beta <- 2; sigma <- 3
x <- rnorm(1000)
y <- rnorm(1000, alpha + beta * x, sigma)
```

Now that we have some fake data, let’s write out the log likelihood function, as before. Note that because the optimization function we use

*minimizes*the function we need to returnt the*negative*of the log likelihood (minimizing the negative is the same as maximizing the likelihood).```
negative_log_likelihood <- function(theta, # the unknown parameters
x, y) {
# This function returns the log likelihood of a normal with mean theta[1] + theta[2]*x
# and standard deviation exp(theta[3])
# evaluated at a vector y.
-sum(log(dnorm(y, theta[1] + theta[2]*x, exp(theta[3]))))
}
```

Now we optimize

```
estimates <- optim(par = c(1,1,1), negative_log_likelihood, x = x, y =y)
estimated_alpha <- estimates$par[1]
estimated_beta <- estimates$par[2]
estimated_sigma <- exp(estimates$par[3])
paste("estimate of alpha is", round(estimated_alpha, 2), "true value is", alpha)
```

`## [1] "estimate of alpha is 1.13 true value is 1"`

`paste("estimate of beta is", round(estimated_beta, 2), "true value is", beta)`

`## [1] "estimate of beta is 1.94 true value is 2"`

`paste("estimate of sigma is", round(estimated_sigma, 2), "true value is", sigma)`

`## [1] "estimate of sigma is 2.98 true value is 3"`

Hey presto! We just estimated a model by maximizing the likelihood by varying the parameters of our model

*keeping the data fixed*. You should note that this method can be applied more generally in much larger, more interesting models.### Prior distributions

Prior distributions summarize our information about the values of parameters

*before*seeing the data. For the uninitiated, this is the scary bit of building a Bayesian model, often because of fears of being biased, or having a poor understanding of how exactly the parameter influences the model. Such fears are often revealed by the choice of extremely diffuse priors, for instance .
Don’t be afraid of using priors. You almost always have high quality information about the values parameters might take on. For example:

- Estimates from previous studies
- Some unknowns have sign restrictions (fixed costs or discount rates probably aren’t negative; price elasticities of demand probably aren’t positive)
- The knowledge that regularization can help prevent over-fitting
- The scale of effects are probably known. Going to college probably won’t increase your income 100000%.

Prior distributions should be such that they put positive probabilistic weight on possible outcomes, and no weight on impossible outcomes. In the example above, the standard deviation of the residuals, must be positive. And so its prior should not have probabilistic weight below 0. That might guide the choice to a distribution like a half-Normal, half-Cauchy, half-Student’s t, inverse Gamma, lognormal etc.

### Bayes rule

Bayes rule gives us a method for combining the information from our data (the likelihood) and our priors. It says that the (joint) probability density of our parameter vector is

where is the likelihood, and is the prior. We call the posterior. Because doesn’t depend on the vector of unknowns, , we often express the posterior up to a constant of proportionality

What can you take away from this? A few things:

- If the prior or likelihood is equal to zero for a given value of , so too will be the posterior
- If the prior is very peaked, the posterior well be drawn towards the prior
- If the likelihood is very peaked (as tends to happen when you have many observations per unknown), the posterior will be drawn towards the likelihood estimate.

Bayes’ rule also tells us that in order to obtain a posterior, we need to have a prior and a likelihood.

### Using penalized likelihood estimation

Just as we can estimate a model using maximum likelihood, we can also estimate the

*mode*of the posterior using*penalized likelihood*estimation. This is a really great way to start thinking about Bayesian estimation.
To estimate a model using penalized likelihood, we simply need to recognize that if

then

We do this in the function below. I’ve been verbose in my implementation to make it very clear what’s going on. The

`target`

is proportional to the accumulated log posterior density. To maximize the log posterior density, we minimize the negative target. We’ll choose priors```
negative_penalized_log_likelihood <- function(theta, # the unknown parameters
x, y) {
# This function returns a value proportional to the
# log posterior density of a normal with mean theta[1] + theta[2]*x
# and standard deviation exp(theta[3])
# evaluated at a vector y.
# Initialize the target
target <- 0
# Add the density of the priors to the accumulator
# Add the log density of the scale prior at the value (truncated normal)
target <- target + log(truncnorm::dtruncnorm(exp(theta[3]), a = 0))
# Add the log density of the intercept prior at the parameter value
target <- target + log(dnorm(theta[1]))
# Add the log density of the slope prior at the parameter value
target <- target + log(dnorm(theta[2]))
# Add the Log likelihood
target <- target + sum(log(dnorm(y, theta[1] + theta[2]*x, exp(theta[3]))))
# Return the negative target
-target
}
```

Now we optimize

```
estimates <- optim(par = c(1,1,1), negative_penalized_log_likelihood, x = x, y =y)
estimated_alpha <- estimates$par[1]
estimated_beta <- estimates$par[2]
estimated_sigma <- exp(estimates$par[3])
paste("estimate of alpha is", round(estimated_alpha, 2), "true value is", alpha)
```

`## [1] "estimate of alpha is 1.12 true value is 1"`

`paste("estimate of beta is", round(estimated_beta, 2), "true value is", beta)`

`## [1] "estimate of beta is 1.93 true value is 2"`

`paste("estimate of sigma is", round(estimated_sigma, 2), "true value is", sigma)`

`## [1] "estimate of sigma is 2.97 true value is 3"`

And there you go! You’ve estimated your very first (very simple) model using penalized likelihood.

Before you go on and get too enthusiastic about estimating everything with penalized likelihood, be warned: the modal estimator will do fine in fairly low dimensions, and with many fairly simple models. But as your models become more complex you’ll find that it does not do a good job. This is partly because the mode of a high-dimensional distribution can be a long way from its expected value (which is what you really care about). But also because you often want to generate predictions that take into account the uncertainty you have in your model’s unknowns when making predictions. For this, you’ll need to

*draw samples*from your posterior.
We’ll not get there just yet—that’s for another post. But just first, let’s take a quick look at what a posterior really looks like in practical terms.

#### What does a posterior look like?

In Bayesian inference we do not get point estimates for the unknowns in our models; we estimate their posterior distributions. Ideally, we’d want to be able to make posterior inference by asking questions like “what is the 5th percentile of the marginal posterior of ?”. This would require that we can analytically derive these statistics from the posterior. Sadly, most posteriors do not have a closed form, and so we need to approximate. The two common types of approximation are:

- To approximate the posterior with a joint density for which we can analytically evaluate quantiles, expected values etc. This is the approach in Variational Bayes and penalized likelihood.
- To obtain many independent draws from the posterior, and evaluate quantiles, expected values of those draws. This is the approach in MCMC and ABC.

In theory, our posterior is an abstract distribution; in practice (when using MCMC), it’s a matrix of data, where each column corresponds to a parameter, and each row a draw from . For instance, the first five rows of our posterior matrix for our linear model above might look like:

a | b | sigma |
---|---|---|

1.96 | 0.48 | 0.98 |

2.06 | 0.52 | 0.86 |

2.06 | 0.54 | 1.02 |

1.90 | 0.48 | 1.02 |

1.95 | 0.45 | 0.92 |

Each of these parameter combinations implies a different predictive distribution. Consequently, they also imply different predictive likelihoods for our out of sample datapoint . This is illustrated for 30 posterior replicates below.

The density of the data across posterior draws is known as the

*log posterior density*of a datapoint. Unlike the likelihood, we are uncertain of its value and need to perform inference. To illustrate this, let’s look at a histogram of the log posterior density in the above example (but with 500 replicates, not 30):
Note that the log posterior density can be evaluated

*for each datapoint*; if we have many datapoints for which we want to evaluate the LPD we simply need to evaluate it for each, and sum across all datapoints.### Conclusion

What I’ve hopefully demonstrated to you is what a likelihood is, and how it fits into Bayesian analysis. Without a likelihood, there is no posterior. The skill of deriving your likelihood and hand-coding up your own model is immensely useful, and maps very nicely to the techniques used in both advanced Bayesian statistics and mainstream machine learning.

## No comments:

## Post a Comment