Wednesday, February 28, 2018

Regime-switching models in Stan

Many time-series exhibit “regimes”—periods for which a set of rules seem to apply. When we jump to a different regime, a different set of rules apply. Yet we never really know which regime we’re in—they aren’t directly observable. This post illustrates the intuition behind building models that have these latent regimes in Stan. They are close in concept to finite mixture models as described here, with the main difference being that we model the changes in state probabilities as being a discrete (hidden) Markov process, rather than a continuous autoregressive process. The following explanation uses James Hamilton’s fantastic notes.

The generative process

As in the linked post above, we’ll use 100 times the daily difference in logged prices of a stock as the example time-series to model. Call these percentage returns . Just as before, we have two states: a random walk state, and a momentum state, (in which returns persist). I’ll assume normal distributions for these daily returns—you shouldn’t do this in real life. Just want to keep things simple. The generative models under the two states are:
State 1:  (prices take a random walk)
State 2:  (some momentum in returns)
We’ll call the true state in period  . Of course, we don’t actually know what state we’re in on a given day, but we can infer probabilistically if we have a model for it. The particular way we’ll model this discrete state is as a Markov chain. That is, 

  
—all the information relevant to making a forecast for what the state is in the next period is contained in what the state was the previous period.
It might help to illustrate the Markov matrix explicitly:

Strategy for fitting the model

The problem with fitting this model is that we don’t know what state we were in in the previous period for sure, which necessitates an iterative algorithm.
First, let’s define the likelihood contribution of an observation  under the two states

In Stan notation, this is just
// inside a for loop over t
eta1[t] = exp(normal_lpdf(y[t] | alpha_1, sigma_1));

eta2[t] = exp(normal_lpdf(y[t] | alpha_2 + rho * y[t-1], sigma_2));
If we knew the state in the previous period for sure, we could evaluate the log likelihood contribution of  as

That is, the likelihood contribution is simply the weighted average of likelihood contributions under the two proposed models, where the weights are the probability of being in each state.
Of course, we don’t know the probability of being in each state. Let  be the probability of being in state  in period , so  is the probability of being in that state the previous period. This gives us the likelihood contribution

Again, this is just a weighted average of likelihood contributions, weighted now by the probability of being in each state, which itself is uncertain and therefore we have to average across the probability of being in each of those states last period.
But where does s come from? We can back out the probability of being in each state in t by noticing that the probability of being in each state is

Once we have the likelihood contributions, we can calculate the log likelihood

Priors and identification

The big problem with modeling any mixture-type model is that if the two generative models are fairly similar, the relative likelihoods will be fairly similar and you can wind up with each MCMC chain mis-labeling the different models. Mike Betancourt has a wonderful primer on this here. In order to identify the model, we need prior information that makes each of the potential data generating processes distinguishable from each other. In the case of the model above, the case where  makes the data generating processes the same, so we need a strong prior on this parameter in order to identify the model. You’ll know that your regime-switching model is poorly identified when your Rhats are larger than 1 but each chain is mixing well when considered individually.
We want the autoregressive model to be distinguishable from the white-noise model. So let’s go with a fairly tight prior on . If we relax this prior too much, we expect that  will get dragged towards 0 and the the model will get less identifiable.
 is the daily expected percentage point return, which should be unbounded. With 252 trading days a year, a  would imply one-standard-deviation annual returns of 25%, which seems reasonable.
The innovation scales must be positive—we’ll use half Cauchy .
 must be in (0, 1), and we expect considerable stickiness of states so we’ll use a .
Finally the initial values of  we’ll use .

R and Stan code for implementation

First, let’s get the same data we used for the last example (note that the Quandl code has changed now that they want to make money).

library(tidyverse); library(rstan); library(Quandl)
# Now with real data! 
googl <- Quandl("WIKI/GOOGL", collapse = "weekly")

googl <- googl %>%
  mutate(Date = as.Date(Date)) %>%
  arrange(Date) %>% 
  mutate(l_ac = log(`Adj. Close`),
         dl_ac = c(NA, diff(l_ac))) %>% 
  filter(Date > "2010-01-01")

plot.ts(googl$dl_ac, main = "Weekly changes in Google's adjusted close price")
The Stan code for the model is below. Note that I’ve not used the log_sum_exp operator, but you could do this. Also you could use some matrix algebra to simplify the creation of f and xi in the transformed parameters block—you’ll probably want to do this if you have more than one state.

// saved as regime_switching_model.stan
data {
  int T;
  vector[T] y;
}
parameters {
  vector<lower = 0, upper = 1>[2] p;
  real<lower = 0> rho;
  vector[2] alpha;
  vector<lower = 0>[2] sigma;
  real<lower = 0, upper = 1> xi1_init; 
  real y_tm1_init;
}
transformed parameters {
  matrix[T, 2] eta;
  matrix[T, 2] xi;
  vector[T] f;
  
  // fill in etas
  for(t in 1:T) {
    eta[t,1] = exp(normal_lpdf(y[t]| alpha[1], sigma[1]));
    if(t==1) {
      eta[t,2] = exp(normal_lpdf(y[t]| alpha[2] + rho * y_tm1_init, sigma[2]));
    } else {
      eta[t,2] = exp(normal_lpdf(y[t]| alpha[2] + rho * y[t-1], sigma[2]));
    }
  }
  
  // work out likelihood contributions
  
  for(t in 1:T) {
    // for the first observation
    if(t==1) {
      f[t] = p[1]*xi1_init*eta[t,1] + // stay in state 1
             (1 - p[1])*xi1_init*eta[t,2] + // transition from 1 to 2
             p[2]*(1 - xi1_init)*eta[t,2] + // stay in state 2 
             (1 - p[2])*(1 - xi1_init)*eta[t,1]; // transition from 2 to 1
      
      xi[t,1] = (p[1]*xi1_init*eta[t,1] +(1 - p[2])*(1 - xi1_init)*eta[t,1])/f[t];
      xi[t,2] = 1.0 - xi[t,1];
    
    } else {
    // and for the rest
      
      f[t] = p[1]*xi[t-1,1]*eta[t,1] + // stay in state 1
             (1 - p[1])*xi[t-1,1]*eta[t,2] + // transition from 1 to 2
             p[2]*xi[t-1,2]*eta[t,2] + // stay in state 2 
             (1 - p[2])*xi[t-1,2]*eta[t,1]; // transition from 2 to 1
      
      // work out xi
      
      xi[t,1] = (p[1]*xi[t-1,1]*eta[t,1] +(1 - p[2])*xi[t-1,2]*eta[t,1])/f[t];
      
      // there are only two states so the probability of the other state is 1 - prob of the first
      xi[t,2] = 1.0 - xi[t,1];
    }
  }
  
}
model {
  // priors
  p ~ beta(10, 2);
  rho ~ normal(1, .1);
  alpha ~ normal(0, .1);
  sigma ~ cauchy(0, 1);
  xi1_init ~ beta(2, 2);
  y_tm1_init ~ normal(0, .1);
  
  
  // likelihood is really easy here!
  target += sum(log(f));
}
Next, we compile the model

options(mc.cores = parallel::detectCores())
compiled_model <- stan_model("regime_switching_model.stan")

googl_mod <- sampling(compiled_model, data= list(T = nrow(googl), y = googl$dl_ac*100), iter = 1000, chains = 4)
We can check that the model has converged (our Rhats should be around 1).

print(googl_mod, pars = c("alpha", "rho", "p", "sigma"))
## Inference for Stan model: regime_switching_model.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##          mean se_mean   sd  2.5%   25%  50%   75% 97.5% n_eff Rhat
## alpha[1] 0.06    0.00 0.08 -0.10  0.00 0.06  0.11  0.21  2000    1
## alpha[2] 0.01    0.00 0.10 -0.20 -0.06 0.01  0.07  0.21  2000    1
## rho      0.96    0.00 0.10  0.76  0.89 0.96  1.03  1.16  2000    1
## p[1]     0.96    0.00 0.02  0.92  0.95 0.96  0.97  0.99  2000    1
## p[2]     0.54    0.00 0.12  0.31  0.46 0.55  0.63  0.77  2000    1
## sigma[1] 2.85    0.00 0.13  2.59  2.77 2.85  2.94  3.10  2000    1
## sigma[2] 9.69    0.06 2.26  6.56  8.15 9.24 10.76 15.31  1672    1
## 
## Samples were drawn using NUTS(diag_e) at Wed Feb 28 11:22:52 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

So what do the state probabilities look like? Let’s extract the first column of xi and plot it over time.

goog <- as.data.frame(googl_mod, pars = "xi") %>% 
  gather(par, value) %>% 
  filter(grepl(",1", par)) %>%
  mutate(time = readr::parse_number(stringr::str_extract(par, "[0-9]{1,3}[,]"))) %>% 
  group_by(par) %>% 
  summarise(time = first(time), 
            mean = mean(value),
            lower = quantile(value, .025),
            upper = quantile(value, .975)) %>% 
  mutate(date = readr::parse_date(googl$Date)) %>% 
  ggplot(aes(x = date, y = mean)) +
  #geom_point(alpha = 0.3) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3,  fill = "green") +
  labs(title ="Probability of being in random walk state\n vsv momentum state",
       y = "Probability",
       subtitle = "GOOGL")

close <- ggplot(data = googl,aes(x = Date, y = `Adj. Close`)) +
  geom_line(colour = "green")

gridExtra::grid.arrange(goog, close)
As we can see, there are relatively few episodes of genuine momentum activity, and the week-to-week persistence of momentum is only slightly above .5. So I’d not be too gung-ho about trading on this model just yet.

A possible extension

As it is, we’ve specified the Markov transition probabilities  and  as parameters. Yet we have a wealth of information that might provide information about how those probabilities might change. In Stan, it is extremely easy to model these parameters directly, for instance using

where  is some unbounded function of the information available at the beginning of the week. But that’s for another day!

9 comments:

  1. Thanks for this post Jim! I've been playing with it as a template for my own work and it has been very useful. I do have one question. Why is your alpha[1] value near zero? Shouldn't it be near 100 given that you added 100 to your data before sampling? It seems the alpha parameter results are just a summary of prior samples rather than the posterior density.

    ReplyDelete
    Replies
    1. Goo catch! The prior on alpha should clearly be wider; when I was writing the post I was using daily data, in which case a .1% daily return would be pretty good! Clearly it's way too tight for weekly changes.

      Delete
  2. Thanks for a this write up, very useful for a beginner like me. I have two short comments.

    1/ the data now contains more than 1000 observations so the regexp on the 4th line above should become "[0-9]{1,4}[,]".

    2/ More importantly, I was getting identification issues with my data, and one way of solving that was to add the constraint that impose that sigma[1] > sigma[2].

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
  4. I've never seen so much bs spam in comments on a stat blog post before.

    ReplyDelete
  5. if I have multiple states, how can I adjust/learn the transition probability by giving more weights for self-transition?

    ReplyDelete
  6. Hello friends, its fantastic article about teaching and entirely explained, keep it up all the time. 야한동영상

    Click this link
    야설

    ReplyDelete
  7. Thanks for sharing. I found a lot of interesting information here. A really good post, very thankful and hopeful that you will write many more posts like this one. 일본야동

    Click this link
    한국야동

    ReplyDelete
  8. Great Information sharing .. I am very happy to read this article .. thanks for giving us go through info. Fantastic nice. I appreciate this post. 국산야동

    Click this link
    야설

    ReplyDelete