Modern Statistical Workflow: A brief introduction to Dynamic Factor Models in Stan

David Stephan and Jim Savage

31 August 2016

Policymakers are forced to rely on many different partial indicators (such as retail trade figures or vacancy data) to try and form a picture of the current state of the economy. The difficulty faced when analysing partial data is what importance to place on a certain series, especially large movements. Are they a sign of a coming downturn, or are they merely inherent noise in the measurement of the series?

The process of combining all the series into a picture of the business cycle is quite difficult and often messy. Given that the business cycle is not entirely captured by any one variable (whether it be GDP, employment or share prices), but rather reflects movements in many variables, it would be beneficial to have an indicator that somehow provided a consistent and coherent method for combining the useful information provided by the partial indicators.

One technique for achieving this goal is to what are known as factor models. We can think of macroeconomic time series as incorporating a common component and an idiosyncratic component. Think of any macroeconomic indicator, such as employment growth:

where $f_{t}$ is the common factor with factor loading (or weight) $γ$ , and $ϵ_{t}$ is a an idiosyncratic error term, about which we need to give a distributional assumption. This same specification will hold for all the series we are interested in (and/or have data for):

There are a number of ways in which we could extract the common factor. In many studies, principal components is used. In our setting, especially if we wish to use a mixture of data frequencies (monthly, quarterly, etc.) it may be a simpler to instead use a state-space representation of our model. We retain the above equation as the signal or measurement equation of our system and allow a simple assumption for the evolution of the underlying state variable (the common factor):

where the common factor evolves as a simple autoregressive model with coefficient $θ$ , and $ω$ is a an idiosyncratic error term, which we assume is distributed normally with standard deviation $σ$ . It is easy to extend the structure of the model to have more than one common factor and for the factor itself to evolve as a higher order or a random walk.

Factor models such as these have been shown to provide competitive forecasts of major macroeconomic aggregates when compared with standard time-series models. There are several versions of such models in use at the NY FED and the ECB.

Putting together some Australian data

In the example we give below, we use standardized annual growth in Australian employment and GDP as our signals of economic activity. These can be easily pulled from Quandl from within R:

library(Quandl); library(dplyr); library(rstan); library(ggplot2); library(reshape2)
options(mc.cores = parallel::detectCores())

#real gdp millions chained dollars - seasonally adj
GDP <- Quandl("AUSBS/5206002_EXPENDITURE_VOLUME_MEASURES_A2304402X") %>% 
  mutate(Date = as.Date(Date)) %>% 
  arrange(Date) %>% 
  mutate(GDP_growth = log(Value/lag(Value, 4))) %>% 
  dplyr::select(Date, GDP_growth)

#employed persons - monthly seasonally adj
EMP <- Quandl("AUSBS/6202001_A84423043C") %>% 
  mutate(Date = as.Date(Date)) %>% 
  arrange(Date) %>% 
  mutate(Emp_growth = log(Value/lag(Value, 12))) %>%
  dplyr::select(Date, Emp_growth)

# Join series together, and set missing values to -999 (Stan doesn't like NAs)
full_data <- left_join(EMP, GDP, by = "Date") %>% 
  mutate(GDP_growth = ifelse(is.na(GDP_growth), -999, GDP_growth)) %>%
  filter(!is.na(Emp_growth))

Now we have our dataset, let’s write the model out in Stan. As you can see, this is very similar to writing out the underlying probability model on paper.

data {
  int T; // number of obs
  int P; // number of observed variables
  int frequency[P];
  matrix[T,P] Y; //dataset of generated series
}

parameters {
  vector[T] xhat; // state variable
  real<lower = 0> gamma_1; // First factor loading--restricted to positive for identificaton
  vector<lower = 0>[P-1] gamma_rest; // The rest of the factor loadings
  vector[1] theta; // AR(1) Coef on state series
  real<lower = 0> sigma_state; // The scale of innovations to the state
}
transformed parameters {
  vector[P] gamma;
  
  // Need to create a complete factor loading vector
  
  gamma[1] = gamma_1;
  gamma[2:P] = gamma_rest;
  
}

model {
  // priors
  xhat[1] ~ normal(0,1);
  sigma_state ~ normal(0, .3); 
  gamma ~ normal(0, 1);
  theta ~ normal(0, 1);

  // State Equation
  for(t in 2:T) {
    xhat[t] ~ normal(xhat[t-1]*theta[1],sigma_state);
  }

  // Measurement Equations
  for(t in 1:T) {
    for(p in 1:P) {
      if(Y[t,p] != -999) {
        int freq;
        freq = frequency[p] - 1;
        if(freq>1 && t >2) {
          Y[t,p] ~ normal(((freq + 1.0)^-1)*sum(xhat[(t-freq):t])*gamma[p],1);
        }
      } 
    }
  }
}
//Remember to add a blank line at the end of the Stan program

You will notice two tricks we use in implementing this model. First, factor models are not sign identified—the we could put a negative sign in front of the factor values and the factor loadings and the outcome would be the same. So we impose sign restrictions. Second, the measurement of GDP is actually a measurement of three periods of the factor, not one (the first time we put up this post, we didn’t use this specification—thanks to Tim’s comments below). And so we use a little trick so that the monthly signals we have

but for the quarterly-measured variables we have the average of the last three month’s factors.

Now we can estimate the model. This is done by calling it from R:

# Standardize columns (without -999s) and replace -999s
full_data_tmp <- full_data[,-1]
full_data_tmp[full_data_tmp==-999] <- NA
full_data_tmp <- scale(full_data_tmp)[,1:ncol(full_data_tmp)]
full_data_tmp[is.na(full_data_tmp)] <- -999

# Now run the model
dfm_model <- stan(file = "DFM_Mixed_Stan.stan", 
                  data = list(T = nrow(full_data),
                              P = ncol(full_data[,-1]),
                              frequency = c(1, 3),
                              Y = full_data_tmp), chains = 4)

Let’s have a look at the state estimates relative to the observed series.

summarised_state <- as.data.frame(dfm_model) %>% 
  select(contains("xhat")) %>%
  melt() %>% 
  group_by(variable) %>% 
  summarise(median = median(value),
            lower = quantile(value, 0.025),
            upper = quantile(value, 0.975)) %>% 
  mutate(Emp_growth = full_data_tmp[,1],
         GDP_growth = full_data_tmp[,2],
         GDP_growth = ifelse(GDP_growth == -999, NA, GDP_growth),
         Date = full_data$Date)
  
summarised_state %>% 
  ggplot(aes(x = Date)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), fill = "orange", alpha = 0.3) +
  geom_line(aes(y = median)) +
  geom_line(aes(y = Emp_growth), colour = "red") +
  geom_point(aes(y = GDP_growth)) +
  annotate("text", x = as.Date("2000-01-31"), y = 2.5, label = "GDP growth deviations (points)") +
  annotate("text", x = as.Date("2000-01-31"), y = 3.8, label = "Employment growth deviations", colour = "red") +
  ggthemes::theme_economist() +
  ggtitle("Dynamic factor model state and credibility band")

And there you have it. We have extracted a common factor from two dynamic series. If you want to increase the number of series, simply add them to the Y matrix (with corresponding -999s for missing values).

8 comments:

TimSeptember 1, 2016 at 5:32 AM
Thanks for the post. Just one comment. The dynamic factor is a monthly variable, but GDP is a quarterly variable, which should be be a function of current and past values of the factor. Don't know much about Stan, but I assume your Stan code should look like

Y[t,p] ~ normal(xhat[t]*gamma[p],1);

for unemployment and

Y[t,p] ~ normal(1/3*(xhat[t]+xhat[t-1]+xhat[t-2])*gamma[p],1)

for GDP (I have used a linear approximation).

javageSeptember 1, 2016 at 6:01 AM
Hey Tim, thanks for your comment. I thought I'd escaped this issue by expressing everything as annual growth rates, but thinking about it some more I'm not so sure. Will fix it up!
TimSeptember 1, 2016 at 10:28 PM
This comment has been removed by the author.
TimSeptember 1, 2016 at 11:15 PM
To be honest I overlooked the fact that you are calculating year on year changes. I think the way you did it now is correct. Quarterly GDP, Y[t]^q=Y[t]^m+Y[t-1]^m+Y[t-2]^m, where Y[t]^m is the monthly flow of GDP. The year-on-year growth rate is

(Y[t]^q-Y[t-12]^q)/Y[t-12]^q=(Y[t]^m+Y[t-1]^m+Y[t-2]^m-Y[t-12]^m-Y[t-13]^m-Y[t-14]^m)/(Y[t-12]^m-Y[t-13]^m-Y[t-14]^m)

From we can get to an expression where the quarterly growth rate is a weighted average of the past three periods of year-on-year monthly growth rates. Approximate the weights by using 1/3, and we get (Y[t]^q-Y[t-12]^q)/Y[t-12]^q=1/3(f[t]+f[t-1]+f[t-2]).

Given the new specification perhaps it is better to have month-on-month growth rates for employment and quarter-on-quarter growth rates for GDP. One should detect turning points sooner, than the other, no?
UnknownOctober 12, 2018 at 8:47 AM
This comment has been removed by the author.
MuninDecember 20, 2018 at 8:20 AM
I just stumbled across your model. I've been wrangling in some ecologic time series data into it. I was able to pull the factor loadings out with a slight modification to your code:

# Factor loadings
summarised_factors <- as.data.frame(dfm_model) %>%
select(contains("gamma_")) %>%
melt() %>%
group_by(variable) %>%
summarise(median = median(value),
lower = quantile(value, 0.025),
upper = quantile(value, 0.975))

I was digging through your model code. Does this model allow for multiple trends or just a single trend? My guess is a single common trend.

Thanks for sharing this.
King AbbasiJune 12, 2021 at 8:47 AM
Choosing a real estate company is perhaps the single most important choice you can make when buying or selling a property in the U.K. The purchase or sale of a home is often the largest financial transaction the average person will make. Doesn't make sense to have a capable and experienced real estate firm as your side to guide you though what can be a complicated process, both legal and financial? 港区の高級賃貸

Wednesday, August 31, 2016

A brief introduction to Dynamic Factor Models in Stan

David Stephan and Jim Savage

31 August 2016

Putting together some Australian data

8 comments: