Saturday, October 22, 2016

Finite mixture model with time-varying probabilities - part 1

In my last post, I described a simple model in which each observation of our data could have one of two densities. We estimated the parameters of both densities, and the probability of the data coming from either. While finite mixture models as in the last post are a useful learning aid, we might want richer models for applied work. In particular, we might want the probability of our data having each density to vary across observations. This is the first of two posts dedicated to this topic. I gave a talk covering some of this a few days ago (best viewed in Safari).
For sake of an example, consider this: the daily returns series of a stock has two states. In the first, the stock is ‘priced to perfection’, and so the price is an I(1) random walk (daily returns are mean stationary). In the second, there is momentum—here, daily returns have AR(1) structure. Explicitly, for daily log returns :
State 1: 
State 2: 
When we observe a value of , we don’t know for sure whether it came from the first or second model–that is precisely what we want to infer. For this, we need a model for the probability that an observation came from each state . One such model could be:


with


Here,  is a function of the information available at the beginning of day t. If we had interesting information about sentiment, or news etc., it could go in here. For simplicity, let’s say .
Under this specification (and for a vector containing all parameters, θ), we can specify the likelihood contribution of an observation. It is simply the weighted average of likelihoods under each candidate data generating process, where the weights are the probabilities that the data comes from each density.


As discussed in the last post, we work in log likelihoods, not likelihoods. This means we should use the log_sum_exp() function in Stan. This means that we express the log likelihood contribution of a single point as:
log_sum_exp(log(inv_logit(mu[t])) + normal_lpdf(r[t] | alpha[1], sigma[1]),
            log((1 - inv_logit(mu[t]))) + normal_lpdf(r[t] | alpha[2] + rho[1], sigma[2]))
Stan has recently added another function which performs the same calculation, but makes writing it out a bit easier. For two log densities lp1,lp2 and a mixing probability theta, we have
log_mix(theta, lp1, lp2) = log_sum_exp(log(theta) + lp1,
                                       log(1-theta) + lp2)

Writing out the model

The Stan code for the model is:
// saved as time_varying_finite_mixtures.stan
data {
  int T;
  vector[T] r;
}
parameters {
  vector[T] mu;
  vector[2] rho;
  real beta;
  vector<lower = 0>[3] sigma;
  vector[3] alpha; 
}
model {
  // priors 
  mu[1] ~ normal(0, .1);
  sigma ~ cauchy(0, 0.5);
  rho ~ normal(1, .1);
  beta~ normal(.5, .25);
  alpha[1:2] ~ normal(0, 0.1);
  alpha[3] ~ normal(0, 1);

  // likelihood
  for(t in 2:T) {
    mu[t] ~ normal(alpha[3] + rho[1]*mu[t-1] + beta* r[t-1], sigma[3]);

    target += log_mix(inv_logit(mu[t]), 
                      normal_lpdf(r[t] | alpha[1], sigma[1]), 
                      normal_lpdf(r[t] | alpha[2] + rho[2] * r[t-1], sigma[2]));
  }
}

Recapturing ‘known unknowns’

As should be clear by now, I believe strongly that we should simulate from the model and make sure that we can recapture “known unknowns” before taking the model to real data. Below we simulate some fake data.
# Set some fake parameters
alpha1 <- -0.01
alpha2 <- 0.015
rho1 <- 0.95
rho2 <- 0.8
beta <- 0.5

sigma1 <- 0.05
sigma2 <- 0.03
sigma3 <- 0.3
T <- 500
r <- rep(NA, T)
r[1] <- 0

mu <- rep(NA, T)
z <- rep(NA, T)
mu[1] <- 0
z[1] <- 1


# Simulate the data series
for(t in 2:T) {
  mu[t]  <- rho1 * mu[t-1] + beta*(r[t-1]) + rnorm(1, 0, sigma3)
  prob <- arm::invlogit(mu[t])
  z[t] <- sample(1:2, 1, prob = c(prob, 1-prob))
  
  if(z[t]==1) {
    # random walk state
    r[t] <- rnorm(1, alpha1, sigma1)
  } else {
    # momentum state
    r[t] <- rnorm(1, alpha2 + rho2*r[t-1], sigma2)
  }
}
You should plot your data before doing anything. Let’s take a look.
# Plot the returns
plot.ts(r)
# Plot the probability of the random walk state
plot.ts(arm::invlogit(mu))
Looks good! Now we compile and run the model.
compiled_model <- stan_model("time_varying_finite_mixtures.stan")

estimated_model <- sampling(compiled_model, data = list(r = r, T = T), cores = 4, chains = 4)
Now we inspect the parameter estimates, which should align with those in our data generating process.
print(estimated_model, pars = c("alpha", "rho", "sigma"))
## Inference for Stan model: time_varying_finite_mixtures.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##           mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## alpha[1] -0.02    0.00 0.00 -0.02 -0.02 -0.01 -0.01 -0.01   949 1.00
## alpha[2]  0.02    0.00 0.00  0.01  0.02  0.02  0.02  0.02  1862 1.00
## alpha[3]  0.01    0.00 0.04 -0.06 -0.01  0.01  0.03  0.09   285 1.02
## rho[1]    0.86    0.01 0.06  0.72  0.82  0.86  0.90  0.97    73 1.06
## rho[2]    0.82    0.00 0.04  0.74  0.79  0.82  0.85  0.91  1048 1.00
## sigma[1]  0.05    0.00 0.00  0.04  0.04  0.05  0.05  0.05  4000 1.00
## sigma[2]  0.03    0.00 0.00  0.02  0.03  0.03  0.03  0.03   766 1.00
## sigma[3]  0.23    0.04 0.13  0.06  0.11  0.22  0.31  0.53    13 1.35
## 
## Samples were drawn using NUTS(diag_e) at Sat Oct 22 18:33:58 2016.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
It seems that most of the parameters appear to have estimated quite cleanly–most of the Rhats are fairly close, to 1, with the exception of the standard deviation of the updates in the latent series (which will be very weakly identified, given we don’t observe mu). We would fix this by adding better prior information to the model.

Taking the model to real data

Now we know that our program can recapture a known model, we can take it to some real data. In this case, we’ll use the log differences in sequential adjusted closing prices for Apple’s common stock. With Apple being such a large, well-researched (and highly liquid) stock, we should expect that it spends almost all time in the random walk state. Let’s see what the data say!
# Now with real data! 
aapl <- Quandl::Quandl("YAHOO/AAPL")

aapl <- aapl %>%
  mutate(Date = as.Date(Date)) %>%
  arrange(Date) %>% 
  mutate(l_ac = log(`Adjusted Close`),
         dl_ac = c(NA, diff(l_ac))) %>% 
  filter(Date > "2015-01-01")

aapl_mod <- sampling(compiled_model, data= list(T = nrow(aapl), r = aapl$dl_ac*100))
Now check that the model has fit properly
shinystan::launch_shinystan(aapl_mod)
And finally plot the probability of being in each state.
plot1 <- aapl_mod %>% 
  as.data.frame() %>% 
  select(contains("mu")) %>%
  melt() %>% 
  group_by(variable) %>% 
  summarise(lower = quantile(value, 0.95), 
            median = median(value),
            upper = quantile(value, 0.05)) %>% 
  mutate(date = aapl$Date,
         ac = aapl$l_ac) %>%
  ggplot(aes(x = date)) + 
  geom_ribbon(aes(ymin = arm::invlogit(lower), ymax = arm::invlogit(upper)), fill= "orange", alpha = 0.4) +
  geom_line(aes(y = arm::invlogit(median))) +
  ggthemes::theme_economist() +
  xlab("Date") +
  ylab("Probability of random walk model")


plot2 <- aapl_mod %>% 
  as.data.frame() %>% 
  select(contains("mu")) %>%
  melt() %>% 
  group_by(variable) %>% 
  summarise(lower = quantile(value, 0.95), 
            median = median(value),
            upper = quantile(value, 0.05)) %>% 
  mutate(date = aapl$Date,
         ac = aapl$`Adjusted Close`) %>%
  ggplot(aes(x = date, y = ac)) +
  geom_line() +
  ggthemes::theme_economist() +
  xlab("Date") +
  ylab("Adjusted Close")

gridExtra::grid.arrange(plot1, plot2)
And there we go! As expected, Apple spends almost all their time in the random walk state, but, surprisingly, appears to have had a few periods with some genuine (mainly negative) momentum.

Building up the model

The main problem with this model is that our latent state  can only really vary so much from period to period. That can delay the response to the appearance of a new state, and slow the process of “flipping back” into the regular state. One way of getting around this is to have a discrete state with more flexibility in flipping between states. We’ll explore this in the next post, on Regime-Switching models.