Modern Statistical Workflow: August 2016

David Stephan and Jim Savage

31 August 2016

Policymakers are forced to rely on many different partial indicators (such as retail trade figures or vacancy data) to try and form a picture of the current state of the economy. The difficulty faced when analysing partial data is what importance to place on a certain series, especially large movements. Are they a sign of a coming downturn, or are they merely inherent noise in the measurement of the series?

The process of combining all the series into a picture of the business cycle is quite difficult and often messy. Given that the business cycle is not entirely captured by any one variable (whether it be GDP, employment or share prices), but rather reflects movements in many variables, it would be beneficial to have an indicator that somehow provided a consistent and coherent method for combining the useful information provided by the partial indicators.

One technique for achieving this goal is to what are known as factor models. We can think of macroeconomic time series as incorporating a common component and an idiosyncratic component. Think of any macroeconomic indicator, such as employment growth:

where $f_{t}$ is the common factor with factor loading (or weight) $γ$ , and $ϵ_{t}$ is a an idiosyncratic error term, about which we need to give a distributional assumption. This same specification will hold for all the series we are interested in (and/or have data for):

There are a number of ways in which we could extract the common factor. In many studies, principal components is used. In our setting, especially if we wish to use a mixture of data frequencies (monthly, quarterly, etc.) it may be a simpler to instead use a state-space representation of our model. We retain the above equation as the signal or measurement equation of our system and allow a simple assumption for the evolution of the underlying state variable (the common factor):

where the common factor evolves as a simple autoregressive model with coefficient $θ$ , and $ω$ is a an idiosyncratic error term, which we assume is distributed normally with standard deviation $σ$ . It is easy to extend the structure of the model to have more than one common factor and for the factor itself to evolve as a higher order or a random walk.

Factor models such as these have been shown to provide competitive forecasts of major macroeconomic aggregates when compared with standard time-series models. There are several versions of such models in use at the NY FED and the ECB.

Putting together some Australian data

In the example we give below, we use standardized annual growth in Australian employment and GDP as our signals of economic activity. These can be easily pulled from Quandl from within R:

library(Quandl); library(dplyr); library(rstan); library(ggplot2); library(reshape2)
options(mc.cores = parallel::detectCores())

#real gdp millions chained dollars - seasonally adj
GDP <- Quandl("AUSBS/5206002_EXPENDITURE_VOLUME_MEASURES_A2304402X") %>% 
  mutate(Date = as.Date(Date)) %>% 
  arrange(Date) %>% 
  mutate(GDP_growth = log(Value/lag(Value, 4))) %>% 
  dplyr::select(Date, GDP_growth)

#employed persons - monthly seasonally adj
EMP <- Quandl("AUSBS/6202001_A84423043C") %>% 
  mutate(Date = as.Date(Date)) %>% 
  arrange(Date) %>% 
  mutate(Emp_growth = log(Value/lag(Value, 12))) %>%
  dplyr::select(Date, Emp_growth)

# Join series together, and set missing values to -999 (Stan doesn't like NAs)
full_data <- left_join(EMP, GDP, by = "Date") %>% 
  mutate(GDP_growth = ifelse(is.na(GDP_growth), -999, GDP_growth)) %>%
  filter(!is.na(Emp_growth))

Now we have our dataset, let’s write the model out in Stan. As you can see, this is very similar to writing out the underlying probability model on paper.

data {
  int T; // number of obs
  int P; // number of observed variables
  int frequency[P];
  matrix[T,P] Y; //dataset of generated series
}

parameters {
  vector[T] xhat; // state variable
  real<lower = 0> gamma_1; // First factor loading--restricted to positive for identificaton
  vector<lower = 0>[P-1] gamma_rest; // The rest of the factor loadings
  vector[1] theta; // AR(1) Coef on state series
  real<lower = 0> sigma_state; // The scale of innovations to the state
}
transformed parameters {
  vector[P] gamma;
  
  // Need to create a complete factor loading vector
  
  gamma[1] = gamma_1;
  gamma[2:P] = gamma_rest;
  
}

model {
  // priors
  xhat[1] ~ normal(0,1);
  sigma_state ~ normal(0, .3); 
  gamma ~ normal(0, 1);
  theta ~ normal(0, 1);

  // State Equation
  for(t in 2:T) {
    xhat[t] ~ normal(xhat[t-1]*theta[1],sigma_state);
  }

  // Measurement Equations
  for(t in 1:T) {
    for(p in 1:P) {
      if(Y[t,p] != -999) {
        int freq;
        freq = frequency[p] - 1;
        if(freq>1 && t >2) {
          Y[t,p] ~ normal(((freq + 1.0)^-1)*sum(xhat[(t-freq):t])*gamma[p],1);
        }
      } 
    }
  }
}
//Remember to add a blank line at the end of the Stan program

You will notice two tricks we use in implementing this model. First, factor models are not sign identified—the we could put a negative sign in front of the factor values and the factor loadings and the outcome would be the same. So we impose sign restrictions. Second, the measurement of GDP is actually a measurement of three periods of the factor, not one (the first time we put up this post, we didn’t use this specification—thanks to Tim’s comments below). And so we use a little trick so that the monthly signals we have

but for the quarterly-measured variables we have the average of the last three month’s factors.

Now we can estimate the model. This is done by calling it from R:

# Standardize columns (without -999s) and replace -999s
full_data_tmp <- full_data[,-1]
full_data_tmp[full_data_tmp==-999] <- NA
full_data_tmp <- scale(full_data_tmp)[,1:ncol(full_data_tmp)]
full_data_tmp[is.na(full_data_tmp)] <- -999

# Now run the model
dfm_model <- stan(file = "DFM_Mixed_Stan.stan", 
                  data = list(T = nrow(full_data),
                              P = ncol(full_data[,-1]),
                              frequency = c(1, 3),
                              Y = full_data_tmp), chains = 4)

Let’s have a look at the state estimates relative to the observed series.

summarised_state <- as.data.frame(dfm_model) %>% 
  select(contains("xhat")) %>%
  melt() %>% 
  group_by(variable) %>% 
  summarise(median = median(value),
            lower = quantile(value, 0.025),
            upper = quantile(value, 0.975)) %>% 
  mutate(Emp_growth = full_data_tmp[,1],
         GDP_growth = full_data_tmp[,2],
         GDP_growth = ifelse(GDP_growth == -999, NA, GDP_growth),
         Date = full_data$Date)
  
summarised_state %>% 
  ggplot(aes(x = Date)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), fill = "orange", alpha = 0.3) +
  geom_line(aes(y = median)) +
  geom_line(aes(y = Emp_growth), colour = "red") +
  geom_point(aes(y = GDP_growth)) +
  annotate("text", x = as.Date("2000-01-31"), y = 2.5, label = "GDP growth deviations (points)") +
  annotate("text", x = as.Date("2000-01-31"), y = 3.8, label = "Employment growth deviations", colour = "red") +
  ggthemes::theme_economist() +
  ggtitle("Dynamic factor model state and credibility band")

And there you have it. We have extracted a common factor from two dynamic series. If you want to increase the number of series, simply add them to the Y matrix (with corresponding -999s for missing values).

Modern Statistical Workflow

Wednesday, August 31, 2016

A brief introduction to Dynamic Factor Models in Stan

David Stephan and Jim Savage

31 August 2016

Putting together some Australian data