## Tuesday, April 4, 2017

### Hierarchical vector autoregression

A few days ago, Nathaniel Bechhofer wrote to me on Twitter asking if I knew of a good way to use repeated observations (hierarchical) data to estimate a covariance matrix. I’ve done a lot of work with Bayesian Vector Autoregressions (there are about three almost-finished posts on these coming up), but I had never played around with with VARs estimated by partial pooling. This posts illustrates one approach to the technique.
For the uninitiated, a vector autoregression relates current values of a vector of (unconstrained) outcomes in period  to its own lags. We typically assume that the residuals of a VAR process (often called “innovations”) are multivariate normal, with a big part of the literature concerned with estimating the covariance matrix of these innovations under various restrictions.
Explicitly, given a vector of intercepts  and a matrix of coefficients , a VAR(1) process is

with . For the sake of interpretability, we normally decompose  as , where  is a vector of scales (in this case, marginal standard deviations of the innovations in the VAR process), and  is their correlation matrix. A VAR(p) process generalises this to  lags.

#### From VAR to Hierarchcial VAR

Now what if we have outcomes for many units of observation? For instance, we might observe a vector of outcomes for many countries over many years (that is the example below). How do we approach this?
One method would be to “pool” the data together and estimate the model given above. This is the same as assuming that all countries have the same values of  and . A strong assumption? Another approach would be to estimate the model above for each country individually—an “unpooled” estimate. This is the typical approach; it is equivalent to saying that there is nothing we can learn from the behaviour of other countries when estimating the parameters for a given country.
The method I favour is “partial pooling”, which balances these two approached. In partial pooling, parameter estimates are “shrunk” towards a common average, with the degree of shrinkage proportional to the noisiness of the unpooled estimate. Intuitively, this is like saying that the information from the general behaviour across countries can help us get better estimates for an individual country.

#### An example

The example below is a simple hierarchical VAR(1). Each country, indexed by $c$ is observed in some time period $t$. We have country-specific intercepts , slopes , innovation scales  and innovation correlations . Our data are generates as:

with

Each of the country-level parameters are given so-called hierarchical priors. These are priors that are, to an extent, informed by behaviour across countries.
Our hierarchical priors are

and

It is a little difficult to give the correlation matrices  a hierarchical prior, as their values are constrained by other values. One approach—at Ben Goodrich’s suggestion—is to use a weighting parameter  and shrink local correlation estimates towards a global correlation matrix using

with

and

A big advantage of this approach is that we end up with an estimate for  as well as the correlation matrices. This tells us how important the partial pooling is to the estimates of the . The priors indicate that we expect relatively more correlation at the global than the local level.

#### A worked example

In this example, we pull data from the World Bank’s World Development Indicators (which has a handy R API). In particular, we’ll pull annual values of GDP, consumption, and gross fixed capital formation (investment) for all countries from 1970. Figures are annual, and in constant-price local currency units. We’ll convert the data into annual (continuous compounding) growth rates, and model these using the hierarchical VAR described above.
# We'll use WDI data from the World Bank, dplyr and rstan
library(WDI); library(dplyr)
library(rstan)
options(mc.cores = parallel::detectCores())

# Grab gdp, consumption and investment (lcu constant prices)
gdp_cons_inv <- WDI(indicator = c("NY.GDP.MKTP.KN","NE.CON.TOTL.KN", "NE.GDI.FTOT.KN"),
start = 1970) 
Next we convert the data into differenced logs, and restrict the observations to those for which we have no missing observations.
For the sake of estimating the model quickly, I restrict the data to the English-speaking countries plus Chile. For all 138 countries (on a four-core laptop), the model took about an hour to estimate.
# Take complete cases, then take diffed logs. Make sure we don't keep countries with < 10 years of data
gdp_cons_inv_1 <- gdp_cons_inv%>%
filter(complete.cases(.)) %>%
rename(GDP = NY.GDP.MKTP.KN,
CONS = NE.CON.TOTL.KN,
GFCF = NE.GDI.FTOT.KN) %>%
group_by(country) %>%
arrange(year) %>%
mutate(dl_gdp = c(NA, diff(log(GDP))),
dl_cons = c(NA, diff(log(CONS))),
dl_gfcf = c(NA, diff(log(GFCF))),
more_than_10= sum(!is.na(dl_gfcf))>10) %>%
arrange(country, year) %>%
ungroup %>%
filter(more_than_10 & is.finite(dl_gfcf))

# Complete cases then add a time index (starts at one for every country)
gdp_cons_inv_1 <- gdp_cons_inv_1 %>%
ungroup %>%
filter(complete.cases(.)) %>%
group_by(country) %>%
mutate(time= 1:n())

# Takes a while to run, so I'll just do it with English speaking countries plus Chile (liberal commodity economy)
gdp_cons_inv_2 <- gdp_cons_inv_1 %>%
ungroup %>%
filter(country %in% c("United States", "United Kingdom", "Australia",
"New Zealand", "Chile", "Canada", "Ireland", "South Africa"))
Next we write the Stan model. This is saved as hierarchical_var.stan in the directory that we’re working from. This simply implements the model above, using non-centered parameterization for  and .
// saved as hierarchical_var.stan
data {
int N; // number of observations (number of rows in the panel)
int K; // number of dimensions of the outcome variable
int I; // number of individuals
int T; // The greatest number of time periods for any individual
int<lower = 1, upper = I> individual[N]; // integer vector the same length
// as the panel, indicating which individual each row corresponds to
int<lower = 1, upper = T> time[N]; // integer vector with individual-time
// (not absolute time). Each individual starts at 1
matrix[N, K] Y; // the outcome matrix, each individual stacked on each
// other, observations ordered earliest to latest
}
parameters {
// individual-level parameters
corr_matrix[K] Omega_local[I]; // cholesky factor of correlation matrix of
// residuals each individual (across K outcomes)
vector<lower = 0>[K] tau[I]; // scale vector for residuals
matrix[K, K] z_beta[I];
vector[K] z_alpha[I];

// hierarchical priors (individual parameters distributed around these)
real<lower = 0, upper = 1> rho;
corr_matrix[K] Omega_global;
vector[K] tau_location;
vector<lower =0>[K] tau_scale;
matrix[K,K] beta_hat_location;
matrix<lower = 0>[K,K] beta_hat_scale;
vector[K] alpha_hat_location;
vector<lower = 0>[K] alpha_hat_scale;
}
transformed parameters {
matrix[K, K] beta[I]; // VAR(1) coefficient matrix
vector[K] alpha[I]; // intercept matrix
corr_matrix[K] Omega[I];

for(i in 1:I) {
alpha[i] = alpha_hat_location + alpha_hat_scale .*z_alpha[i];
beta[i] = beta_hat_location + beta_hat_scale .*z_beta[i];
Omega[i] = rho*Omega_global + (1-rho)*Omega_local[i];
}

}
model {
// hyperpriors
rho ~ beta(2, 2);
tau_location ~ cauchy(0, 1);
tau_scale ~ cauchy(0, 1);
alpha_hat_location ~ normal(0, 1);
alpha_hat_scale ~ cauchy(0, 1);
to_vector(beta_hat_location) ~ normal(0, .5);
to_vector(beta_hat_scale) ~ cauchy(0, .5);
Omega_global ~ lkj_corr(1);

// hierarchical priors
for(i in 1:I) {
// non-centered parameterization
z_alpha[i] ~ normal(0, 1);
to_vector(z_beta[i]) ~ normal(0, 1);
tau[i] ~ normal(tau_location, tau_scale);
Omega_local[i] ~ lkj_corr(10);
}

// likelihood
for(n in 1:N) {
if(time[n]>1) {
Y[n] ~ multi_normal(alpha[individual[n]] + beta[individual[n]]*Y[n-1]',
}
}
}

We get the data for Stan and compile the model like so:
# prepare data for Stan
mod_data <- list(
individual = as.numeric(as.factor(gdp_cons_inv_2$country)), time = gdp_cons_inv_2$time,
T = max(gdp_cons_inv_2$time), I = max(as.numeric(as.factor(gdp_cons_inv_2$country))),
N = nrow(gdp_cons_inv_2),
Y = gdp_cons_inv_2 %>%
ungroup %>%
select(dl_gdp, dl_cons, dl_gfcf),
K = 3
)

# Compile model
p_var <- stan_model("hierarchical_var.stan")
And finally we estimate the model.
estimated_model <- sampling(p_var,
data = mod_data,
iter = 1000)
Now Nathaniel’s question was about the covariance in the innovations to the outcomes. We can extract the expected value of this global correlation matrix as so:
extract(estimated_model, pars = "Omega_global", permuted = T)[[1]] %>%
apply(c(2, 3), mean)
##
##             [,1]      [,2]      [,3]
##   [1,] 1.0000000 0.8899941 0.9254894
##   [2,] 0.8899941 1.0000000 0.7381890
##   [3,] 0.9254894 0.7381890 1.0000000
This would be the expected value for the correlations between macro variates for a new country. If we wanted to extract the expected correlation matrix for a specific country, say, Australia, we could do
# Australia is the first index
extract(estimated_model, pars = "Omega", permuted = T)[[1]][,1,,] %>%
apply(c(2, 3), mean)
##          [,1]      [,2]      [,3]
## [1,] 1.000000 0.6904760 0.7814910
## [2,] 0.690476 1.0000000 0.5872822
## [3,] 0.781491 0.5872822 1.0000000
Finally we can check to see the relevance of the global correlation matrix when estimating the country-specific correlation matrices by examining the estimate of $\rho$.
print(estimated_model, pars = "rho")
## Inference for Stan model: hierarchical_var.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
##     mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## rho 0.82       0 0.04 0.75 0.79 0.82 0.84   0.9   610 1.01
##
## Samples were drawn using NUTS(diag_e) at Sun Nov 27 16:07:19 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
A value of 0.82 indicates that there is a lot of “borrowed power” from the hierarchical nature of the estimate. (If it were 1, all countries’ correlation matrices would be the global correlation matrix; if it were 0, they’d have nothing in common.)

#### So when should we use a hierarchical VAR?

The real value of using a hierarchical approach is when we want to estimate a richly parameterized model with relatively few data points. In the example above, we’re using 313 observations of 3 variables to estimate 171 parameters. This is only doable if we have good priors. But we don’t want to be too subjective about these priors. Using the hierarchical prior approach here allows us to “learn the priors” by sharing information across units of observation.
As always, if you have any comments, shoot me an email or get in touch on Twitter (@jim_savage_).