Jim Savage
1 September 2016
One of the things I had difficulty with when learning statistics was getting an intuitive grasp of what a covariance matrix is. I mean, I knew it had to do with correlation and variance— when two variables have a big covariance, they tend to move together; when the covariance is negative, they move in opposite directions. But variances are so hard to interpret. And I was a bad student.
This post is a quick tutorial on covariance matrices, the multivariate normal distribution, the LKJ prior, and Box-Cox transformations.
Some data
First, let’s load the data from yesterday’s post; it contains quarterly growth rates of jobs in Australia and real GDP. We’ll only keep the quarterly observations for jobs growth.
library(Quandl); library(dplyr); library(rstan); library(ggplot2); library(reshape2)
options(mc.cores = parallel::detectCores())
#real gdp millions chained dollars - seasonally adj
GDP <- Quandl("AUSBS/5206002_EXPENDITURE_VOLUME_MEASURES_A2304402X") %>%
mutate(Date = as.Date(Date)) %>%
arrange(Date) %>%
mutate(GDP_growth = Value/lag(Value, 1)) %>%
dplyr::select(Date, GDP_growth)
#employed persons - monthly seasonally adj
EMP <- Quandl("AUSBS/6202001_A84423043C") %>%
mutate(Date = as.Date(Date)) %>%
arrange(Date) %>%
mutate(Emp_growth = Value/lag(Value, 3)) %>%
dplyr::select(Date, Emp_growth)
# Join series together
full_data <- left_join(EMP, GDP, by = "Date") %>%
filter(!is.na(GDP_growth), !is.na(Emp_growth))
Now we have the series, let’s plot them both with the excellent ggpairs function in the GGally
package. Along the diagonals, we have density plots of both variables; on the off-diagonals, we have a bivariate plot showing how the two variables are related.
GGally::ggpairs(full_data[,-1], title = "Pairs plot, GDP and employment growth")
Two things you should notice here:
- The two variables are have bell-curve shapes with a slightly fat tail on the left hand side, corresponding to recessions.
- The two variables are fairly correlated.
What we want to do is come up with a joint distribution of these two variables. Ideally that distribution should be one that the data might plausibly have been drawn from. Although the two marginal densities (down the diagonal) don’t look entirely normal, let’s start by modeling the two series jointly using a multivariate normal density. This distribution takes two parameters: a mean vector and a covariance matrix .
What do these parameters correspond to? The mean vector contains the means of each series. In this case, something like
Emp_growth GDP_growth
1.005 1.008
How about the covariance matrix ? Well this parameter tries to do two things. It should capture the variation in each variable individually. And it should capture the correlation between the two.
Rather than think of a covariance matrix as having both purposes, it’s easier to think these two purposes separately, by splitting them into scale and correlations. The scale vector, , contains the standard deviations of the two series. It might contain values that look like
Emp_growth GDP_growth
0.005082 0.00779
The correlations between the two series are captured in a correlation matrix . They might look a bit like this:
Emp_growth GDP_growth
Emp_growth 1 0.3412
GDP_growth 0.3412 1
The covariance matrix , is simply defined as the quadratic form of the scale vector and correlation matrix.
Where creates a square matrix with the values of down the diagonal.
First, let’s load the data from yesterday’s post; it contains quarterly growth rates of jobs in Australia and real GDP. We’ll only keep the quarterly observations for jobs growth.
library(Quandl); library(dplyr); library(rstan); library(ggplot2); library(reshape2)
options(mc.cores = parallel::detectCores())
#real gdp millions chained dollars - seasonally adj
GDP <- Quandl("AUSBS/5206002_EXPENDITURE_VOLUME_MEASURES_A2304402X") %>%
mutate(Date = as.Date(Date)) %>%
arrange(Date) %>%
mutate(GDP_growth = Value/lag(Value, 1)) %>%
dplyr::select(Date, GDP_growth)
#employed persons - monthly seasonally adj
EMP <- Quandl("AUSBS/6202001_A84423043C") %>%
mutate(Date = as.Date(Date)) %>%
arrange(Date) %>%
mutate(Emp_growth = Value/lag(Value, 3)) %>%
dplyr::select(Date, Emp_growth)
# Join series together
full_data <- left_join(EMP, GDP, by = "Date") %>%
filter(!is.na(GDP_growth), !is.na(Emp_growth))
Now we have the series, let’s plot them both with the excellent ggpairs function in the
GGally
package. Along the diagonals, we have density plots of both variables; on the off-diagonals, we have a bivariate plot showing how the two variables are related.GGally::ggpairs(full_data[,-1], title = "Pairs plot, GDP and employment growth")
Two things you should notice here:
- The two variables are have bell-curve shapes with a slightly fat tail on the left hand side, corresponding to recessions.
- The two variables are fairly correlated.
What we want to do is come up with a joint distribution of these two variables. Ideally that distribution should be one that the data might plausibly have been drawn from. Although the two marginal densities (down the diagonal) don’t look entirely normal, let’s start by modeling the two series jointly using a multivariate normal density. This distribution takes two parameters: a mean vector and a covariance matrix .
What do these parameters correspond to? The mean vector contains the means of each series. In this case, something like
Emp_growth | GDP_growth |
---|---|
1.005 | 1.008 |
How about the covariance matrix ? Well this parameter tries to do two things. It should capture the variation in each variable individually. And it should capture the correlation between the two.
Rather than think of a covariance matrix as having both purposes, it’s easier to think these two purposes separately, by splitting them into scale and correlations. The scale vector, , contains the standard deviations of the two series. It might contain values that look like
Emp_growth | GDP_growth |
---|---|
0.005082 | 0.00779 |
The correlations between the two series are captured in a correlation matrix . They might look a bit like this:
Emp_growth | GDP_growth | |
---|---|---|
Emp_growth | 1 | 0.3412 |
GDP_growth | 0.3412 | 1 |
The covariance matrix , is simply defined as the quadratic form of the scale vector and correlation matrix.
Where creates a square matrix with the values of down the diagonal.
Estimating the parameters of the density
Now the parameter values printed above—the mean, standard deviation, and correlations—are all parameters of the data. But we really want to draw inference about the parameters of the model that we are saying might have given rise to the data. We need to estimate the parameters of the model.
For this, we need our generative distribution, which we said before was multivariate normal. Where is the vector containing the annual employment growth and annual GDP growth in period t,
Where
We also need to provide priors. is a vector of standard deviations, which must be positive. We might give it a half Cauchy prior, which allows for potentially large values for each standard deviation.
What distribution could we give to the correlation matrix ? Such a distribution needs to take into account a few facts about correlation matrices:
- The diagonals are always 1.
- The matrix is symmetric with off-diagonal entries between -1 and 1.
- The matrix should be positive semidefinite. If it were not, we would get negative variances.
- The value of one entry in a matrix has implications for other entries in the matrix. If variables A and B are positively correlated, and B and C are positively correlated, A and C cannot be negatively correlated.
One such density that does all of the above is the LKJ density. It takes a single parameter, , which must be positive. When , this is the same as having a uniform joint prior over the correlations in the system. When grows very large, our correlation matrix collapses to a diagonal matrix. For the purposes of the model, we’ll use
This is implementable using the following Stan code:
// saved as multi_normal_model.stan
data {
int T; // number of time periods
int P; // number of variables
matrix[T, P] Y; // observations
}
parameters {
vector[P] mu;
vector<lower = 0>[P] tau;
corr_matrix[P] Omega;
}
model {
// priors
mu ~ normal(1, 1);
tau ~ cauchy(0, .1);
Omega ~ lkj_corr(4);
// likelihood
for(t in 1:T) {
Y[t] ~ multi_normal(mu, quad_form_diag(Omega, tau));
}
}
generated quantities {
vector[T] y_lpd;
// record the log posterior densities (for model comparison)
for (t in 1:T) {
y_lpd[t] = multi_normal_lpdf(Y[t] | mu, quad_form_diag(Omega, tau));
}
}
which we call from R:
multi_normal_model <- stan("multi_normal_model.stan",
data = list(T = nrow(full_data),
P = 2,
Y = full_data[,-1]))
And view the parameter estimates as so:
## Inference for Stan model: multi_normal_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## mu[1] 1.00 0 0.00 1.00 1.00 1.00 1.00 1.01 4000 1
## mu[2] 1.01 0 0.00 1.01 1.01 1.01 1.01 1.01 4000 1
## Omega[1,1] 1.00 0 0.00 1.00 1.00 1.00 1.00 1.00 4000 NaN
## Omega[1,2] 0.32 0 0.07 0.18 0.28 0.33 0.37 0.45 1387 1
## Omega[2,1] 0.32 0 0.07 0.18 0.28 0.33 0.37 0.45 1387 1
## Omega[2,2] 1.00 0 0.00 1.00 1.00 1.00 1.00 1.00 3870 1
## tau[1] 0.01 0 0.00 0.00 0.00 0.01 0.01 0.01 1931 1
## tau[2] 0.01 0 0.00 0.01 0.01 0.01 0.01 0.01 1710 1
##
## Samples were drawn using NUTS(diag_e) at Mon Sep 5 16:02:29 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Now the parameter values printed above—the mean, standard deviation, and correlations—are all parameters of the data. But we really want to draw inference about the parameters of the model that we are saying might have given rise to the data. We need to estimate the parameters of the model.
For this, we need our generative distribution, which we said before was multivariate normal. Where is the vector containing the annual employment growth and annual GDP growth in period t,
Where
We also need to provide priors. is a vector of standard deviations, which must be positive. We might give it a half Cauchy prior, which allows for potentially large values for each standard deviation.
What distribution could we give to the correlation matrix ? Such a distribution needs to take into account a few facts about correlation matrices:
- The diagonals are always 1.
- The matrix is symmetric with off-diagonal entries between -1 and 1.
- The matrix should be positive semidefinite. If it were not, we would get negative variances.
- The value of one entry in a matrix has implications for other entries in the matrix. If variables A and B are positively correlated, and B and C are positively correlated, A and C cannot be negatively correlated.
One such density that does all of the above is the LKJ density. It takes a single parameter, , which must be positive. When , this is the same as having a uniform joint prior over the correlations in the system. When grows very large, our correlation matrix collapses to a diagonal matrix. For the purposes of the model, we’ll use
This is implementable using the following Stan code:
// saved as multi_normal_model.stan
data {
int T; // number of time periods
int P; // number of variables
matrix[T, P] Y; // observations
}
parameters {
vector[P] mu;
vector<lower = 0>[P] tau;
corr_matrix[P] Omega;
}
model {
// priors
mu ~ normal(1, 1);
tau ~ cauchy(0, .1);
Omega ~ lkj_corr(4);
// likelihood
for(t in 1:T) {
Y[t] ~ multi_normal(mu, quad_form_diag(Omega, tau));
}
}
generated quantities {
vector[T] y_lpd;
// record the log posterior densities (for model comparison)
for (t in 1:T) {
y_lpd[t] = multi_normal_lpdf(Y[t] | mu, quad_form_diag(Omega, tau));
}
}
which we call from R:
multi_normal_model <- stan("multi_normal_model.stan",
data = list(T = nrow(full_data),
P = 2,
Y = full_data[,-1]))
And view the parameter estimates as so:
## Inference for Stan model: multi_normal_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## mu[1] 1.00 0 0.00 1.00 1.00 1.00 1.00 1.01 4000 1
## mu[2] 1.01 0 0.00 1.01 1.01 1.01 1.01 1.01 4000 1
## Omega[1,1] 1.00 0 0.00 1.00 1.00 1.00 1.00 1.00 4000 NaN
## Omega[1,2] 0.32 0 0.07 0.18 0.28 0.33 0.37 0.45 1387 1
## Omega[2,1] 0.32 0 0.07 0.18 0.28 0.33 0.37 0.45 1387 1
## Omega[2,2] 1.00 0 0.00 1.00 1.00 1.00 1.00 1.00 3870 1
## tau[1] 0.01 0 0.00 0.00 0.00 0.01 0.01 0.01 1931 1
## tau[2] 0.01 0 0.00 0.01 0.01 0.01 0.01 0.01 1710 1
##
## Samples were drawn using NUTS(diag_e) at Mon Sep 5 16:02:29 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
But wait! Wasn’t our data a bit non-normal?
Alert readers would have noticed that the data that we’ve just fit a multivariate normal distribution to is a little skewed. In such a case, we might want to model a transformation of the data that seems more normal. In this particular case the transformation doesn’t actually make much of a difference, but the principle stands. One such transformation is the Box-Cox transformation, in which we transform a (nonnegative) series with a transformation parameter such that
if is no 0, and otherwise.
In this case, we will estimate the value of lambda for each series. The transformation is implemented in Stan below:
// saved as multi_normal_model_boxcox.stan
data {
int T; // number of time periods
int P; // number of variables
matrix[T, P] Y; // observations
}
parameters {
vector[P] mu;
vector[P] lambda;
vector<lower = 0>[P] tau;
corr_matrix[P] Omega;
}
transformed parameters {
matrix[T, P] Y_transformed;
// This is where we make the transformation
for(p in 1:P) {
for(t in 1:T) {
if(lambda[p]==0) {
Y_transformed[t,p] = log(Y[t,p]);
} else {
Y_transformed[t,p] = (Y[t,p]^lambda[p]-1)/lambda[p];
}
}
}
}
model {
// priors
mu ~ normal(1, 2);
lambda ~ normal(0, 15);
tau ~ cauchy(0, 2);
Omega ~ lkj_corr(4);
// likelihood
for(t in 1:T) {
Y_transformed[t] ~ multi_normal(mu, quad_form_diag(Omega, tau));
}
}
generated quantities {
vector[T] y_lpd;
// record the log posterior densities (for model comparison)
for (t in 1:T) {
y_lpd[t] = multi_normal_lpdf(Y_transformed[t] | mu, quad_form_diag(Omega, tau));
}
}
Which we call from R as so
boxcox_multinormal <- stan("multi_normal_model_boxcox.stan",
data = list(T = nrow(full_data),
P = 2,
# Add some constant to all observations to guarantee positivity
Y = full_data[,-1]))
Now that we have estimated both the transformed and untransformed models on the same data, we can check to see which appears to fit the data better. Ideally, we’d want to do cross-validation, which would involve fitting the model many times, checking the log posterior density on the holdout set. For models like this one, this is too time-consuming. Instead, we use an approximation to leave-one-out cross validation, available in the loo
package.
To implement this model comparison, we first need to extract the log likelihood (which across draws is the in-sample log posterior density) from the model. This is precisely the object we created in the generated quantities
block in the Stan code—the (log) height of the data density for each data-point at parameters from a given draw. Next, we run the loo()
function, which performs the cross-validation approximation. Finally, we can compare the two models with compare()
, which reports the difference in expected log posterior density of the two models. If compare
returns a positive value, the expected out-of-sample log posterior density of the second model is higher than the first model, meaning it should perform better out of sample.
library(loo)
# Extract log posterior densities from both models
mod_1_ll <- extract_log_lik(multi_normal_model, "y_lpd")
mod_2_ll <- extract_log_lik(boxcox_multinormal, "y_lpd")
# Estimate leave-one-out cross validation
loo_1 <- loo(mod_1_ll)
loo_2 <- loo(mod_2_ll)
# Now compare
compare(loo_1, loo_2)
## elpd_diff se
## 41.0 19.8
As we can see, the expected log posterior density for the second model—our model with the Box-Cox transformation—is slightly higher than for the simple multi normal model.
Alert readers would have noticed that the data that we’ve just fit a multivariate normal distribution to is a little skewed. In such a case, we might want to model a transformation of the data that seems more normal. In this particular case the transformation doesn’t actually make much of a difference, but the principle stands. One such transformation is the Box-Cox transformation, in which we transform a (nonnegative) series with a transformation parameter such that
if is no 0, and otherwise.
In this case, we will estimate the value of lambda for each series. The transformation is implemented in Stan below:
// saved as multi_normal_model_boxcox.stan
data {
int T; // number of time periods
int P; // number of variables
matrix[T, P] Y; // observations
}
parameters {
vector[P] mu;
vector[P] lambda;
vector<lower = 0>[P] tau;
corr_matrix[P] Omega;
}
transformed parameters {
matrix[T, P] Y_transformed;
// This is where we make the transformation
for(p in 1:P) {
for(t in 1:T) {
if(lambda[p]==0) {
Y_transformed[t,p] = log(Y[t,p]);
} else {
Y_transformed[t,p] = (Y[t,p]^lambda[p]-1)/lambda[p];
}
}
}
}
model {
// priors
mu ~ normal(1, 2);
lambda ~ normal(0, 15);
tau ~ cauchy(0, 2);
Omega ~ lkj_corr(4);
// likelihood
for(t in 1:T) {
Y_transformed[t] ~ multi_normal(mu, quad_form_diag(Omega, tau));
}
}
generated quantities {
vector[T] y_lpd;
// record the log posterior densities (for model comparison)
for (t in 1:T) {
y_lpd[t] = multi_normal_lpdf(Y_transformed[t] | mu, quad_form_diag(Omega, tau));
}
}
Which we call from R as so
boxcox_multinormal <- stan("multi_normal_model_boxcox.stan",
data = list(T = nrow(full_data),
P = 2,
# Add some constant to all observations to guarantee positivity
Y = full_data[,-1]))
Now that we have estimated both the transformed and untransformed models on the same data, we can check to see which appears to fit the data better. Ideally, we’d want to do cross-validation, which would involve fitting the model many times, checking the log posterior density on the holdout set. For models like this one, this is too time-consuming. Instead, we use an approximation to leave-one-out cross validation, available in the
loo
package.
To implement this model comparison, we first need to extract the log likelihood (which across draws is the in-sample log posterior density) from the model. This is precisely the object we created in the
generated quantities
block in the Stan code—the (log) height of the data density for each data-point at parameters from a given draw. Next, we run the loo()
function, which performs the cross-validation approximation. Finally, we can compare the two models with compare()
, which reports the difference in expected log posterior density of the two models. If compare
returns a positive value, the expected out-of-sample log posterior density of the second model is higher than the first model, meaning it should perform better out of sample.library(loo)
# Extract log posterior densities from both models
mod_1_ll <- extract_log_lik(multi_normal_model, "y_lpd")
mod_2_ll <- extract_log_lik(boxcox_multinormal, "y_lpd")
# Estimate leave-one-out cross validation
loo_1 <- loo(mod_1_ll)
loo_2 <- loo(mod_2_ll)
# Now compare
compare(loo_1, loo_2)
## elpd_diff se
## 41.0 19.8
As we can see, the expected log posterior density for the second model—our model with the Box-Cox transformation—is slightly higher than for the simple multi normal model.
Edit - what happens in the transformation?
The Box-Cox transformation above rescales the data using the parameters lambda
, summarised below.
print(boxcox_multinormal, pars = "lambda")
## Inference for Stan model: multi_normal_model_boxcox.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
## lambda[1] -37.96 0.16 8.24 -54.37 -43.46 -37.75 -32.37 -22.07 2747
## lambda[2] -49.02 0.11 6.12 -61.25 -53.15 -49.09 -44.77 -37.13 2901
## Rhat
## lambda[1] 1
## lambda[2] 1
##
## Samples were drawn using NUTS(diag_e) at Mon Sep 5 16:04:09 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
What does this imply for the normality of the rescaled data?
# Take our data, which we'll rescale
positive_data <- full_data[,-1]
# Grab the lambdas and take the mean
lambdas_mean <- as.data.frame(boxcox_multinormal) %>%
select(contains("lambda")) %>%
summarise_each(funs(median)) %>%
melt
## No id variables; using all as measure variables
# Transform the data according to the Box-Cox transformation
positive_data[,1] <- (positive_data[,1]^lambdas_mean$value[1] - 1)/-lambdas_mean$value[1]
positive_data[,2] <- (positive_data[,2]^lambdas_mean$value[2] - 1)/-lambdas_mean$value[2]
# Plot the transformed data
GGally::ggpairs(positive_data, title = "Pairs plot, transformed data")
As we can see here, we still have some skewness, though this skewness is towards lower values rather than higher ones. Any gains from this transformation are likely to be small.
The Box-Cox transformation above rescales the data using the parameters
lambda
, summarised below.print(boxcox_multinormal, pars = "lambda")
## Inference for Stan model: multi_normal_model_boxcox.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
## lambda[1] -37.96 0.16 8.24 -54.37 -43.46 -37.75 -32.37 -22.07 2747
## lambda[2] -49.02 0.11 6.12 -61.25 -53.15 -49.09 -44.77 -37.13 2901
## Rhat
## lambda[1] 1
## lambda[2] 1
##
## Samples were drawn using NUTS(diag_e) at Mon Sep 5 16:04:09 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
What does this imply for the normality of the rescaled data?
# Take our data, which we'll rescale
positive_data <- full_data[,-1]
# Grab the lambdas and take the mean
lambdas_mean <- as.data.frame(boxcox_multinormal) %>%
select(contains("lambda")) %>%
summarise_each(funs(median)) %>%
melt
## No id variables; using all as measure variables
# Transform the data according to the Box-Cox transformation
positive_data[,1] <- (positive_data[,1]^lambdas_mean$value[1] - 1)/-lambdas_mean$value[1]
positive_data[,2] <- (positive_data[,2]^lambdas_mean$value[2] - 1)/-lambdas_mean$value[2]
# Plot the transformed data
GGally::ggpairs(positive_data, title = "Pairs plot, transformed data")
As we can see here, we still have some skewness, though this skewness is towards lower values rather than higher ones. Any gains from this transformation are likely to be small.
Thank you for this tutorial. I found it very helpful! One question: How would you compute predicted values for employment and GDP growth from the multivariate normal model in the generated quantities block?
ReplyDeleteJust discovered this helpful post, thank you OP. This reply is belated, but here below is how one would compute predicted values within the generated quantities block. The trick is in the to_row_vector(x) function which converts x so that it will fit into a T x P matrix container:
Deletegenerated quantities {
matrix[T, P] y_pred;
for (t in 1:T) {
y_pred[t] = to_row_vector(multi_normal_rng(mu, quad_form_diag(Omega, tau)));
}
But really, the T-different rows of predicted values aren't very interesting in this particular example. Each row of GDP and employment prediction is a draw from the same identical distribution. In this model, mu (and also omega and tau) aren't conditional on time or any other predictors. You could get the same TxP predicted values working with the "multi_normal_model" object outside of Stan entirely.