Monday, April 10, 2017

Why learn Bayesian Modeling?

Jim Savage

[I'm drip-feeding you bits of my upcoming book on Bayesian modeling. Please feel free to leave critical comments as they'll make book better :-)]

Learning Bayesian modeling does require a time investment. If you build and estimate statistical models for a living, it is probably an investment worth making, but we should be very explicit about the benefits (and costs) up-front. The benefits are many. Using Bayesian methods, we can take advantage of information that does not necessarily exist in our data (or model structure) in estimating our model. We can combine sources of information in a simple, coherent fashion. Uncertainty in our predictions automatically incorporates uncertainty in parameter estimates. We can define models that are arbitrarily rich—potentially with more parameters than we have data-points— and still expect coherent parameter estimates. We don’t use tests; instead we just check the implied probabilities of outcomes of interest in our estimated model, whose parameters we can give probabilistic interpretations. And perhaps most importantly, using Bayesian methods forces us to understand our models far more deeply than with canned routines.
These benefits don’t come free. The approach we advocate in this book—estimating models using full Markov Chain Monte Carlo—can appear slow. Learning Bayesian techniques, and a new programming language, is costly. And some fields have strong frequentist cultures, making communication of Bayesian results an important part of your work. We feel these costs are small relative to the potential gains.
Let’s illustrate these benefits with examples. These examples are from real-world applied work by ourselves and our colleagues; hopefully you will see analogies with your own work.

Benefit 1: Incorporating knowledge from outside the data

When we run a field experiment, we typically want to evaluate the impact of some experimental treatment on an outcome. This impact is known as a treatment effect. Experiments can be costly do perform, limiting the number of observations that we can collect. Consequently, in these small-data studies it is common to have very imprecise estimates of the treatment effect.
Bayesian analysis of the experiment can help when there is more information available about the the treatment effect than exists the observations from our experiment, as would be the case of there having been previous studies of the same treatment effect. These previous studies’ results can be incorporated into our study using what is known as a hierarchical prior, resulting in more precise estimates of the treatment effect. 
For example, imagine you are an education researcher evaluating the impact of a trendy new educational teaching method on test scores. You run a randomized experiment on seventy students at one school and learn that the intervention improved test scores by 34 points on a scale from to 800. Because of the fairly small sample size, this estimate has a standard error of 23.5, and is not “statistically significant” with a p-value of 0.15 (greater than the arbitrary 0.05 threshold used for declaring statistical significance in many fields). This is consistent with a 95% confidence interval of the estimate of (-12.8 to 80.9). If you were to roll out the intervention on a large cohort of students, what would you expect the treatment effect to be? 0—our estimate is not statistically significant? 34? Some other number?
Now suppose that because this educational intervention is trendy, it is being experimented on by other researchers. You find that these researchers have achieved the following estimates of the treatment effect from the same treatment (this is Rubin’s famous 8 Schools data (Rubin 1981)):
SchoolTreatment effectStandard error
A28.3914.9
B7.9410.2
C-2.7516.3
D6.8211.0
E-0.649.4
F0.6311.4
G18.0110.4
H12.1617.6
After seeing these other study results of the same intervention, how might your expectations of a roll-out of the intervention change? It turns out that we can use these study results in re-analyzing our own data in order to get a more precise estimate of the treatment effect in our own study. 

Benefit 2: Combining sources of information

A similar type of analysis to the is to use Bayesian methods to combine sources of information, as in a meta-analysis or political poll aggregation. The aim of this type of research is to estimate a statistic—for instance, the distribution of political preferences, or a treatment effect—you would expect to find in as-yet untested populations using previous studies as the data source, not the underlying data. The central notion is that these previous studies are noisy estimates of some underlying statistic that applies to the whole population. They are noisy partly because of sampling variability, but also because the studies might differ systematically from one another (political pollsters using different questions, for example). An important assumption is that together, these studies are not systematically biased once we account for observable differences between them.
A recent example of this type of analysis is in (Meager 2016), who uses hierarchical Bayesian modeling to obtain estimates of the generalizable treatment effects (and quantile effects) of microcredit expansions on various household measures of financial success.
The study makes several contributions, but two highlight the power of a (hierarchical) Bayesian approach. The first is that a bi-product of the Bayesian aggregation procedure is an estimate of the generalizability of a given previous study. That is, the procedure tells us how much we can expect to learn about the impact of a yet-untried microcredit expansion from a given experiment. The second is that using a hierarchical Bayesian aggregation procedure gives us new estimates for the treatment effects in the previous studies. Remember: the estimates from those previous studies are noisy. The technique reduces the noise in these estimates by “borrowing power” from other studies. In the report, several of the “statistically significant” findings in the previous studies lose their “significance” once we adjust them for the fact that similar studies find much smaller (or zero) effects. In a world in which costly policy decisions might be influenced by false discoveries, this feature is appealing.

Benefit 3: Dealing with uncertainty consistently in model predictions

We often use models to generate predictions or forecasts. There are several types of uncertainty that we ought to be concerned with. The first is sampling variability: even if we have the “perfect model” (we don’t believe such a thing exists) there will remain variation in what we are predicting, either because of pure randomness or measurement error. If we were to use the model for predictions, we should expect the model to be right on average, but not right in every instance.
The second source of uncertainty is uncertainty in the unknown parameters in our model. A model itself typically combines “knowns” and unknown variables, and the goal of estimating a model is to draw inference about the value of the unknowns. So long as we only have a limited number of observations, we will be unsure of the precise values of the unknowns; this contributes to uncertainty in our predictions.
The third source of uncertainty is uncertainty about whether the model is the right model for the job— is it correctly specified, and are we modeling a stationary (or non-changing) set of relationships? If the model is improperly specified or if the fundamental relationships of the system are changing, our model will not perform well. Importantly, this type of uncertainty is not represented by predictive intervals—it is therefore prudent to treat predictive intervals as the minimum amount of uncertainty that we should have over the outcome.
By using Bayesian methods, we automatically deal with the first and second sources of uncertainty, without resorting to workarounds. Non-Bayesians can do this as well (for example, by bootstrapping), but it is not an inherent part of the procedure. But neither Bayesian nor non-Bayesian techniques deal well with the problem of poorly-specificed models and “model non-stationarity”. So what do we do? Using a well thought-through workflow and “generative reasoning” can help us iterate towards a specification that describes our data on hand well. Unfortunately, model non-stationarity is a difficult, philosophical problem. There are no quick fixes.

Benefit 4: Regularizing richly parameterized models

Many important predictive/control variables have high dimensionality. For example, imagine you are building a model to predict whether the a package delivered from town A to B will be delivered late. An important predictive variable will be the origin and destination towns. In most countries, there are a very large number of post-codes, whose values have no intrinsic meaning (so we cannot use these as a numerical control variable). How can we use these values to generate better predictions? And how should we learn from towns with very few observations vis-à-vis towns with many?
How do we deal with this problem without using Bayesian methods? One popular technique is to “one-hot” encode the categorical predictors, so that our data go from
zip_codeother_data
11111
11112
11113
11114
to
zip_code 11111zip_code 11112zip_code 11113zip_code 11114
1000
0100
0010
0001
And then continue apace. The problem with such an approach is that our parameter space becomes extremely large, and we might end up doing “noise-mining”—discovering relationships where there are none, simply through bad luck. A trick known as “regularization” can help prevent this, and is considered good practice when you have a model with many variables. It does so by “penalizing” parameter estimates that are not estimated precisely—for instance, those zip codes with very few observations—“shrinking” the parameter estimates towards zero (or some other value). Regularization like this is widely used in machine learning. A difficulty is that most canned routines that implement regularization do not allow for easy incorporation of uncertainty in the parameter estimates.
Bayesian methods also employ regularization, through the use of priors. This is because our parameter estimates will always be between the likelihood estimate (typically, the estimates you’ll get from a non-Bayesian approach) and our priors. Strictly, priors should encode the information that we have about parameter estimates before we estimate a model. One valuable type of prior information that we have is “regularization works!” or “mean reversion happens!” These partly motivate the approaches we advocate for modeling hierarchical and panel data, especially the use of hierarchical priors.

Benefit 5: Doing away with tests

In frequentist statistics, inference is performed through the use of tests. Testing is normally the process of combining model and data to come up with a test statistic, which will have some large-sample limiting distribution under the assumption that the null hypothesis is correct. We then compare the test statistic to that limiting distribution to determine whether there exists sufficient evidence to reject the null hypothesis.
Done well, testing can yield useful insights and help guide modeling. But as an approach to workflow and science, testing is difficult to learn and remember, easy to abuse, and with limited data, can result in many erroneous conclusions.
In Bayesian analysis, we do not use tests. All inference is conducted by analyzing our fitted model (the posterior), which is typically a matrix of draws from the joint distribution of all parameters and predictions. The posterior has a probabilistic interpretation, and consequently if we want to make a probabilistic statement about our fitted model, we only need to count the number of draws from the posterior for which a condition holds, and divide it by the number of draws.
This is a far more intuitive and easy-to-use way of conducting inference.

Rubin, D. 1981. “Estimation in Parallel Randomized Experiments.” Journal of Education Statistics 6. American Educational Research Association; American Statistical Association: 377–401.
Meager, Rachael. 2016. “Aggregating Distributional Treatment Effects: A Bayesian Hierarchical Analysis of the Microcredit Literature.”

Tuesday, April 4, 2017

Hierarchical vector autoregression

A few days ago, Nathaniel Bechhofer wrote to me on Twitter asking if I knew of a good way to use repeated observations (hierarchical) data to estimate a covariance matrix. I’ve done a lot of work with Bayesian Vector Autoregressions (there are about three almost-finished posts on these coming up), but I had never played around with with VARs estimated by partial pooling. This posts illustrates one approach to the technique.
For the uninitiated, a vector autoregression relates current values of a vector of (unconstrained) outcomes in period  to its own lags. We typically assume that the residuals of a VAR process (often called “innovations”) are multivariate normal, with a big part of the literature concerned with estimating the covariance matrix of these innovations under various restrictions.
Explicitly, given a vector of intercepts  and a matrix of coefficients , a VAR(1) process is

with . For the sake of interpretability, we normally decompose  as , where  is a vector of scales (in this case, marginal standard deviations of the innovations in the VAR process), and  is their correlation matrix. A VAR(p) process generalises this to  lags.

From VAR to Hierarchcial VAR

Now what if we have outcomes for many units of observation? For instance, we might observe a vector of outcomes for many countries over many years (that is the example below). How do we approach this?
One method would be to “pool” the data together and estimate the model given above. This is the same as assuming that all countries have the same values of  and . A strong assumption? Another approach would be to estimate the model above for each country individually—an “unpooled” estimate. This is the typical approach; it is equivalent to saying that there is nothing we can learn from the behaviour of other countries when estimating the parameters for a given country.
The method I favour is “partial pooling”, which balances these two approached. In partial pooling, parameter estimates are “shrunk” towards a common average, with the degree of shrinkage proportional to the noisiness of the unpooled estimate. Intuitively, this is like saying that the information from the general behaviour across countries can help us get better estimates for an individual country.

An example

The example below is a simple hierarchical VAR(1). Each country, indexed by c is observed in some time period t. We have country-specific intercepts , slopes , innovation scales  and innovation correlations . Our data are generates as:

with

Each of the country-level parameters are given so-called hierarchical priors. These are priors that are, to an extent, informed by behaviour across countries.
Our hierarchical priors are


and

It is a little difficult to give the correlation matrices  a hierarchical prior, as their values are constrained by other values. One approach—at Ben Goodrich’s suggestion—is to use a weighting parameter  and shrink local correlation estimates towards a global correlation matrix using

with


and

A big advantage of this approach is that we end up with an estimate for  as well as the correlation matrices. This tells us how important the partial pooling is to the estimates of the . The priors indicate that we expect relatively more correlation at the global than the local level. 

A worked example

In this example, we pull data from the World Bank’s World Development Indicators (which has a handy R API). In particular, we’ll pull annual values of GDP, consumption, and gross fixed capital formation (investment) for all countries from 1970. Figures are annual, and in constant-price local currency units. We’ll convert the data into annual (continuous compounding) growth rates, and model these using the hierarchical VAR described above.
First we load the packages and download the data
# We'll use WDI data from the World Bank, dplyr and rstan
library(WDI); library(dplyr)
library(rstan)
options(mc.cores = parallel::detectCores())

# Grab gdp, consumption and investment (lcu constant prices)
gdp_cons_inv <- WDI(indicator = c("NY.GDP.MKTP.KN","NE.CON.TOTL.KN", "NE.GDI.FTOT.KN"), 
                    start = 1970) 
Next we convert the data into differenced logs, and restrict the observations to those for which we have no missing observations.
For the sake of estimating the model quickly, I restrict the data to the English-speaking countries plus Chile. For all 138 countries (on a four-core laptop), the model took about an hour to estimate.
# Take complete cases, then take diffed logs. Make sure we don't keep countries with < 10 years of data
gdp_cons_inv_1 <- gdp_cons_inv%>% 
  filter(complete.cases(.)) %>% 
  rename(GDP = NY.GDP.MKTP.KN,
         CONS = NE.CON.TOTL.KN,
         GFCF = NE.GDI.FTOT.KN) %>% 
  group_by(country) %>% 
  arrange(year) %>%
  mutate(dl_gdp = c(NA, diff(log(GDP))), 
         dl_cons = c(NA, diff(log(CONS))),
         dl_gfcf = c(NA, diff(log(GFCF))),
         more_than_10= sum(!is.na(dl_gfcf))>10) %>%
  arrange(country, year) %>%
  ungroup %>%
  filter(more_than_10 & is.finite(dl_gfcf))

# Complete cases then add a time index (starts at one for every country)
gdp_cons_inv_1 <- gdp_cons_inv_1 %>%
  ungroup %>%
  filter(complete.cases(.)) %>% 
  group_by(country) %>% 
  mutate(time= 1:n())

# Takes a while to run, so I'll just do it with English speaking countries plus Chile (liberal commodity economy)
gdp_cons_inv_2 <- gdp_cons_inv_1 %>% 
  ungroup %>% 
  filter(country %in% c("United States", "United Kingdom", "Australia", 
                        "New Zealand", "Chile", "Canada", "Ireland", "South Africa"))
Next we write the Stan model. This is saved as hierarchical_var.stan in the directory that we’re working from. This simply implements the model above, using non-centered parameterization for  and .
// saved as hierarchical_var.stan
data {
  int N; // number of observations (number of rows in the panel)
  int K; // number of dimensions of the outcome variable
  int I; // number of individuals
  int T; // The greatest number of time periods for any individual
  int<lower = 1, upper = I> individual[N]; // integer vector the same length 
  // as the panel, indicating which individual each row corresponds to
  int<lower = 1, upper = T> time[N]; // integer vector with individual-time 
  // (not absolute time). Each individual starts at 1
  matrix[N, K] Y; // the outcome matrix, each individual stacked on each 
  // other, observations ordered earliest to latest
}
parameters {
  // individual-level parameters
  corr_matrix[K] Omega_local[I]; // cholesky factor of correlation matrix of 
  // residuals each individual (across K outcomes)
  vector<lower = 0>[K] tau[I]; // scale vector for residuals
  matrix[K, K] z_beta[I];
  vector[K] z_alpha[I];
  
  // hierarchical priors (individual parameters distributed around these)
  real<lower = 0, upper = 1> rho;
  corr_matrix[K] Omega_global;
  vector[K] tau_location;
  vector<lower =0>[K] tau_scale;
  matrix[K,K] beta_hat_location;
  matrix<lower = 0>[K,K] beta_hat_scale;
  vector[K] alpha_hat_location;
  vector<lower = 0>[K] alpha_hat_scale;
}
transformed parameters {
  matrix[K, K] beta[I]; // VAR(1) coefficient matrix
  vector[K] alpha[I]; // intercept matrix
  corr_matrix[K] Omega[I];
  
  for(i in 1:I) {
    alpha[i] = alpha_hat_location + alpha_hat_scale .*z_alpha[i];
    beta[i] = beta_hat_location + beta_hat_scale .*z_beta[i];
    Omega[i] = rho*Omega_global + (1-rho)*Omega_local[i];
  }
    
}
model {
  // hyperpriors
  rho ~ beta(2, 2);
  tau_location ~ cauchy(0, 1);
  tau_scale ~ cauchy(0, 1);
  alpha_hat_location ~ normal(0, 1);
  alpha_hat_scale ~ cauchy(0, 1);
  to_vector(beta_hat_location) ~ normal(0, .5);
  to_vector(beta_hat_scale) ~ cauchy(0, .5);
  Omega_global ~ lkj_corr(1);

  
  // hierarchical priors
  for(i in 1:I) {
    // non-centered parameterization
    z_alpha[i] ~ normal(0, 1);
    to_vector(z_beta[i]) ~ normal(0, 1);
    tau[i] ~ normal(tau_location, tau_scale);
    Omega_local[i] ~ lkj_corr(10);
  }
  
  // likelihood
  for(n in 1:N) {
    if(time[n]>1) {
      Y[n] ~ multi_normal(alpha[individual[n]] + beta[individual[n]]*Y[n-1]', 
                          quad_form_diag(Omega[individual[n]], tau[individual[n]]));
    }
  }
}
We get the data for Stan and compile the model like so:
# prepare data for Stan
mod_data <- list(
  individual = as.numeric(as.factor(gdp_cons_inv_2$country)),
  time = gdp_cons_inv_2$time,
  T = max(gdp_cons_inv_2$time),
  I = max(as.numeric(as.factor(gdp_cons_inv_2$country))),
  N = nrow(gdp_cons_inv_2),
  Y = gdp_cons_inv_2 %>% 
    ungroup %>%
    select(dl_gdp, dl_cons, dl_gfcf),
  K = 3
)

# Compile model
p_var <- stan_model("hierarchical_var.stan")
And finally we estimate the model.
estimated_model <- sampling(p_var, 
                            data = mod_data, 
                            iter = 1000)
Now Nathaniel’s question was about the covariance in the innovations to the outcomes. We can extract the expected value of this global correlation matrix as so:
extract(estimated_model, pars = "Omega_global", permuted = T)[[1]] %>% 
  apply(c(2, 3), mean)
##       
##             [,1]      [,2]      [,3]
##   [1,] 1.0000000 0.8899941 0.9254894
##   [2,] 0.8899941 1.0000000 0.7381890
##   [3,] 0.9254894 0.7381890 1.0000000
This would be the expected value for the correlations between macro variates for a new country. If we wanted to extract the expected correlation matrix for a specific country, say, Australia, we could do
# Australia is the first index
extract(estimated_model, pars = "Omega", permuted = T)[[1]][,1,,] %>% 
  apply(c(2, 3), mean)
##          [,1]      [,2]      [,3]
## [1,] 1.000000 0.6904760 0.7814910
## [2,] 0.690476 1.0000000 0.5872822
## [3,] 0.781491 0.5872822 1.0000000
Finally we can check to see the relevance of the global correlation matrix when estimating the country-specific correlation matrices by examining the estimate of ρ.
print(estimated_model, pars = "rho")
## Inference for Stan model: hierarchical_var.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##     mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## rho 0.82       0 0.04 0.75 0.79 0.82 0.84   0.9   610 1.01
## 
## Samples were drawn using NUTS(diag_e) at Sun Nov 27 16:07:19 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
A value of 0.82 indicates that there is a lot of “borrowed power” from the hierarchical nature of the estimate. (If it were 1, all countries’ correlation matrices would be the global correlation matrix; if it were 0, they’d have nothing in common.)

So when should we use a hierarchical VAR?

The real value of using a hierarchical approach is when we want to estimate a richly parameterized model with relatively few data points. In the example above, we’re using 313 observations of 3 variables to estimate 171 parameters. This is only doable if we have good priors. But we don’t want to be too subjective about these priors. Using the hierarchical prior approach here allows us to “learn the priors” by sharing information across units of observation.
As always, if you have any comments, shoot me an email or get in touch on Twitter (@jim_savage_).