Jim Savage
24 February 2018
Many time-series exhibit “regimes”—periods for which a set of rules seem to apply. When we jump to a different regime, a different set of rules apply. Yet we never really know which regime we’re in—they aren’t directly observable. This post illustrates the intuition behind building models that have these latent regimes in Stan. They are close in concept to finite mixture models as described here, with the main difference being that we model the changes in state probabilities as being a discrete (hidden) Markov process, rather than a continuous autoregressive process. The following explanation uses James Hamilton’s fantastic notes.
The generative process
As in the linked post above, we’ll use 100 times the daily difference in logged prices of a stock as the example time-series to model. Call these percentage returns . Just as before, we have two states: a random walk state, and a momentum state, (in which returns persist). I’ll assume normal distributions for these daily returns—you shouldn’t do this in real life. Just want to keep things simple. The generative models under the two states are:
State 1: (prices take a random walk)
State 2: (some momentum in returns)
We’ll call the true state in period . Of course, we don’t actually know what state we’re in on a given day, but we can infer probabilistically if we have a model for it. The particular way we’ll model this discrete state is as a Markov chain. That is,
—all the information relevant to making a forecast for what the state is in the next period is contained in what the state was the previous period.
It might help to illustrate the Markov matrix explicitly:
Strategy for fitting the model
The problem with fitting this model is that we don’t know what state we were in in the previous period for sure, which necessitates an iterative algorithm.
First, let’s define the likelihood contribution of an observation under the two states
In Stan notation, this is just
// inside a for loop over t
eta1[t] = exp(normal_lpdf(y[t] | alpha_1, sigma_1));
eta2[t] = exp(normal_lpdf(y[t] | alpha_2 + rho * y[t-1], sigma_2));
If we knew the state in the previous period for sure, we could evaluate the log likelihood contribution of as
That is, the likelihood contribution is simply the weighted average of likelihood contributions under the two proposed models, where the weights are the probability of being in each state.
Of course, we don’t know the probability of being in each state. Let be the probability of being in state in period , so is the probability of being in that state the previous period. This gives us the likelihood contribution
Again, this is just a weighted average of likelihood contributions, weighted now by the probability of being in each state, which itself is uncertain and therefore we have to average across the probability of being in each of those states last period.
But where does s come from? We can back out the probability of being in each state in by noticing that the probability of being in each state is
Once we have the likelihood contributions, we can calculate the log likelihood
Priors and identification
The big problem with modeling any mixture-type model is that if the two generative models are fairly similar, the relative likelihoods will be fairly similar and you can wind up with each MCMC chain mis-labeling the different models. Mike Betancourt has a wonderful primer on this here. In order to identify the model, we need prior information that makes each of the potential data generating processes distinguishable from each other. In the case of the model above, the case where makes the data generating processes the same, so we need a strong prior on this parameter in order to identify the model. You’ll know that your regime-switching model is poorly identified when your Rhats are larger than 1 but each chain is mixing well when considered individually.
We want the autoregressive model to be distinguishable from the white-noise model. So let’s go with a fairly tight prior on . If we relax this prior too much, we expect that will get dragged towards 0 and the the model will get less identifiable.
is the daily expected percentage point return, which should be unbounded. With 252 trading days a year, a would imply one-standard-deviation annual returns of 25%, which seems reasonable.
The innovation scales must be positive—we’ll use half Cauchy .
must be in (0, 1), and we expect considerable stickiness of states so we’ll use a .
Finally the initial values of we’ll use .
R and Stan code for implementation
First, let’s get the same data we used for the last example (note that the Quandl code has changed now that they want to make money).
library(tidyverse); library(rstan); library(Quandl)
# Now with real data!
googl <- Quandl("WIKI/GOOGL", collapse = "weekly")
googl <- googl %>%
mutate(Date = as.Date(Date)) %>%
arrange(Date) %>%
mutate(l_ac = log(`Adj. Close`),
dl_ac = c(NA, diff(l_ac))) %>%
filter(Date > "2010-01-01")
plot.ts(googl$dl_ac, main = "Weekly changes in Google's adjusted close price")
The Stan code for the model is below. Note that I’ve not used the
log_sum_exp
operator, but you could do this. Also you could use some matrix algebra to simplify the creation of f
and xi
in the transformed parameters
block—you’ll probably want to do this if you have more than one state.// saved as regime_switching_model.stan
data {
int T;
vector[T] y;
}
parameters {
vector<lower = 0, upper = 1>[2] p;
real<lower = 0> rho;
vector[2] alpha;
vector<lower = 0>[2] sigma;
real<lower = 0, upper = 1> xi1_init;
real y_tm1_init;
}
transformed parameters {
matrix[T, 2] eta;
matrix[T, 2] xi;
vector[T] f;
// fill in etas
for(t in 1:T) {
eta[t,1] = exp(normal_lpdf(y[t]| alpha[1], sigma[1]));
if(t==1) {
eta[t,2] = exp(normal_lpdf(y[t]| alpha[2] + rho * y_tm1_init, sigma[2]));
} else {
eta[t,2] = exp(normal_lpdf(y[t]| alpha[2] + rho * y[t-1], sigma[2]));
}
}
// work out likelihood contributions
for(t in 1:T) {
// for the first observation
if(t==1) {
f[t] = p[1]*xi1_init*eta[t,1] + // stay in state 1
(1 - p[1])*xi1_init*eta[t,2] + // transition from 1 to 2
p[2]*(1 - xi1_init)*eta[t,2] + // stay in state 2
(1 - p[2])*(1 - xi1_init)*eta[t,1]; // transition from 2 to 1
xi[t,1] = (p[1]*xi1_init*eta[t,1] +(1 - p[2])*(1 - xi1_init)*eta[t,1])/f[t];
xi[t,2] = 1.0 - xi[t,1];
} else {
// and for the rest
f[t] = p[1]*xi[t-1,1]*eta[t,1] + // stay in state 1
(1 - p[1])*xi[t-1,1]*eta[t,2] + // transition from 1 to 2
p[2]*xi[t-1,2]*eta[t,2] + // stay in state 2
(1 - p[2])*xi[t-1,2]*eta[t,1]; // transition from 2 to 1
// work out xi
xi[t,1] = (p[1]*xi[t-1,1]*eta[t,1] +(1 - p[2])*xi[t-1,2]*eta[t,1])/f[t];
// there are only two states so the probability of the other state is 1 - prob of the first
xi[t,2] = 1.0 - xi[t,1];
}
}
}
model {
// priors
p ~ beta(10, 2);
rho ~ normal(1, .1);
alpha ~ normal(0, .1);
sigma ~ cauchy(0, 1);
xi1_init ~ beta(2, 2);
y_tm1_init ~ normal(0, .1);
// likelihood is really easy here!
target += sum(log(f));
}
Next, we compile the model
options(mc.cores = parallel::detectCores())
compiled_model <- stan_model("regime_switching_model.stan")
googl_mod <- sampling(compiled_model, data= list(T = nrow(googl), y = googl$dl_ac*100), iter = 1000, chains = 4)
We can check that the model has converged (our Rhats should be around 1).
print(googl_mod, pars = c("alpha", "rho", "p", "sigma"))
## Inference for Stan model: regime_switching_model.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## alpha[1] 0.06 0.00 0.08 -0.10 0.00 0.06 0.11 0.21 2000 1
## alpha[2] 0.01 0.00 0.10 -0.20 -0.06 0.01 0.07 0.21 2000 1
## rho 0.96 0.00 0.10 0.76 0.89 0.96 1.03 1.16 2000 1
## p[1] 0.96 0.00 0.02 0.92 0.95 0.96 0.97 0.99 2000 1
## p[2] 0.54 0.00 0.12 0.31 0.46 0.55 0.63 0.77 2000 1
## sigma[1] 2.85 0.00 0.13 2.59 2.77 2.85 2.94 3.10 2000 1
## sigma[2] 9.69 0.06 2.26 6.56 8.15 9.24 10.76 15.31 1672 1
##
## Samples were drawn using NUTS(diag_e) at Wed Feb 28 11:22:52 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
So what do the state probabilities look like? Let’s extract the first column of
xi
and plot it over time.goog <- as.data.frame(googl_mod, pars = "xi") %>%
gather(par, value) %>%
filter(grepl(",1", par)) %>%
mutate(time = readr::parse_number(stringr::str_extract(par, "[0-9]{1,3}[,]"))) %>%
group_by(par) %>%
summarise(time = first(time),
mean = mean(value),
lower = quantile(value, .025),
upper = quantile(value, .975)) %>%
mutate(date = readr::parse_date(googl$Date)) %>%
ggplot(aes(x = date, y = mean)) +
#geom_point(alpha = 0.3) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.3, fill = "green") +
labs(title ="Probability of being in random walk state\n vsv momentum state",
y = "Probability",
subtitle = "GOOGL")
close <- ggplot(data = googl,aes(x = Date, y = `Adj. Close`)) +
geom_line(colour = "green")
gridExtra::grid.arrange(goog, close)
As we can see, there are relatively few episodes of genuine momentum activity, and the week-to-week persistence of momentum is only slightly above .5. So I’d not be too gung-ho about trading on this model just yet.
A possible extension
As it is, we’ve specified the Markov transition probabilities and as parameters. Yet we have a wealth of information that might provide information about how those probabilities might change. In Stan, it is extremely easy to model these parameters directly, for instance using
where is some unbounded function of the information available at the beginning of the week. But that’s for another day!