Modern Statistical Workflow: Building useful models for industry

This post is primarily aimed at those interested in building models for use in industry—in particular, finance/banking and consulting. There are many interesting domains for modeling outside these fields, but I know nothing about those. More than anything, this is a collection of modeling tips, tricks and lessons that I’ve gathered from the problems that we have come across at Lendable.

Some context. Lendable, the firm I work for, lends money to finance firms in sub-Saharan Africa. One of the things we’re interested in when we make loans is whether and how much our customers’ customers will repay on their (small) loans. Mobile payments are the primary method for paying back these loans, so there is typically a very good paper-trail of historical payments. We use this paper-trail to build models of borrower behaviour. All these models are implemented in Stan. It would be far harder to do so in other languages.

I’d like to cover a few aspects to this work. They are:

Model scoping. Models in industry are often used to generate multiple outputs for many parties. We need to model these jointly;
Aggregating data can throw away information, but can also throw away noise;
Your datapoints are probably not iid. In finance, using an iid assumptions where the DGP is not is a recipe for disaster;
Thinking generatively affects how you think about the model specification;
Treatments affect a distribution, not just its expected value.

Model scoping

A very typical setting in industry/government is that a business problem has many possible questions, and so the modeler builds a suitably general model that should be able to provide answers to these (many, possibly quite different) questions. What’s more, we want distinct questions to be answered by a model in a consistent way.

As an example, take the Lendable Risk Engine, a model I developed over the last couple of years. This is a model of repayments on micro-loans in Sub-Saharan Africa. Many of these loans have no repayment schedule (so-called Pay-as-you-go), with borrowers paying to “unlock” the asset (the underlying asset might be a solar panel, or water pump). When Lendable is looking to purchase a portfolio of these loans or make a collections recommendation, we might have many questions:

Which customers do we expect will repay the remaining amount of their loans?
What proportion of the remaining balance of a portfolio of loans do we expect will be paid?
What is the expected timing of these cash flows?
How correlated do I expect individual loan repayments to be a) today, b) in a stressed scenario?
How many defaults do I expect to occur? How much do these defaulting customers owe?
What impact on a) an individual, b) the portfolio do I expect will happen from some intervention/change in collections style?

We want the answers to these questions to agree. If I ask “what is the 10th percentile of predicted portfolio cashflows?” I want to know the number of defaults associated with that scenario. Consequently, we need to model the answers to our questions jointly.

So, how do we model our outcomes jointly?

A straightforward way of implementing a joint model is to model a vector of outcomes rather than a single outcome. For continuous-outcomes, Stan allows us to use multivariate normal or multivariate Student-t errors. If our vector of outcomes for person

i

at time

t

Y_{i, t}

and our vector of predictors is

X_{i, t}

, the model might be something like:

Where

A_{i}

is a vector of intercepts for customer

i

B_{i}

is their matrix of coefficients, and

Σ

is a covariance matrix, which we normally decompose into a scale vector

τ

and correlation matrix

Ω

such that

Σ = diag (τ) Ω diag (τ)

.. Implementing this model directly in Stan is possible and straightforward, but the sampling performance can be a bit slow. You’ll get a large speedup by taking a Cholesky factor of the covariance matrix

L = chol (Σ)

and defining

We do this in the transformed parameters block. In the model block, we then say:

What about when my outcome is constrained or not continuous?

We’ll talk about this more later, but there are two important rules of thumb when model building

Your model should place non-zero probabilistic weight on outcomes that do happen
Your model should not place weight on impossible outcomes.

If we have outcomes that are non-continuous, or constrained, a model with multivariate normal or Student-t errors will be severely mis-specified. So what should we do?

For constrained continuous outcomes

y_{1}

, we can create an variable

y_{2}

that is on the unconstrained scale.

For outcomes with a lower bound $a$ , we can define an unconstrained variable $y_{2} = l o g (y_{1} - a)$ or $y_{2} = \frac{(y_{1} - a)^{λ} - 1}{λ}$ for some parameter $λ$ (a Box-Cox transformation).

If the variable with an upper bound $b$ , we make the transformation as $y_{2} = l o g (b - y_{1})$ or $y_{2} = \frac{(b - y_{1})^{λ} - 1}{λ}$
For a series $y$ constrained between a lower value $a$ and upper value $b$ , we can create a new unconstrained variable $y_{2} = l o g i t (\frac{y_{1} - a}{b - a})$ .

Once we have converted our constrained variable

y_{1}

into an unconstrained variable

y_{2}

, we can include it in the outcome vector and model as above.

A similar issue occurs when we we have a combination of continuous and binary or categorical outcomes. This is where the magic of Stan really comes to the fore. We can build transformed parameters that combine known data and unknown parameters into a single object, and then specify the joint probability of that object. An example might be illustrative.

Say we have a vector of continuous outcomes for period

t

Y_{t}

(maybe you can think of this as being a vector of macroeconomic aggregates), and a single binary outcome

d_{t}

(maybe an NBER recession period). Unfortunately we can’t include

d_{t}

in a joint continuous model of

Y_{t}

as joint distributions of discrete and continuous variables don’t exist in Stan. But what we can do is create a continuous parameter

τ_{t}

and append this to our continuous outcomes

Y_{t}

;

τ_{t}

then maps to the discrete outcome

d_{t}

via a probit link function. That is:

// your data declaration here
parameters {
  vector[rows(Y)] tau;
}
transformed parameters {
  matrix[rows(Y), cols(Y)+1] Y2;
  // fill in Y
  Y2[1:rows(Y), 1:cols(Y)] = Y;
  // fill in tau
  Y2[1:rows(Y), cols(Y)+1] = tau;
}
model {
  // priors
  // ...
  // our joint model of Y2 (you should probably optimize this)
  Y2 ~ multi_normal(A + BY_{t}, some_covariance);
  
  // the model of the binary outcome
  d ~ bernoulli(Phi_approx(col(Y2, cols(Y) + 1)));
}

By specifying the model this way, working out the conditional probability of an NBER-defined recession is simply a matter of simulating forward the entire model for each MCMC draw. This will incorporate uncertainty in the parameters, as well as the uncertainty in the future values of the predictors of a recession (provided a VAR(1) structure is correct—it probably isn’t).

Aggregating data

The big advantage of working in industry is the access to incredibly granular data. For instance, the basic unit of data that we receive is a transaction, which is time-stamped. This is fake data, but it might look like this:

date	transaction_amount	customer_id	transaction_type
2016-08-11 05:05:29	102932	11613	pmt
2016-08-14 20:50:29	102902	10277	pmt
2016-08-25 12:06:52	115495	10613	refund
2016-08-31 19:40:58	108238	104	pmt
2016-09-06 03:19:09	80990	66521	pmt

These are incredibly rich data, and are quite typical of what is available in retail or finance. The problem with it is that they contains a lot of noise.

The timing between payments might be meaningful, but often contains no information about future payments.
The precise payment amount on a day might be meaningful, but if customers are invoiced weekly or monthly, we might be missing something. Many customers tend to pay in chunks (several small payments) rather than the monthly due amount, especially in frontier markets where many wages are paid daily.

Because of the noisiness of very granular data, we might more easily extract the signal by aggregating our data. But how much should we aggregate? Should we sum up all payments for each minute (across customers)? For each day or month? Or should we sum within customers across time periods? Of course, your problem will dictate how you might go about this.

One particularly good example of the power of aggregation is in financial markets. In the short run (but not the very short run) equity prices behave as though they’re very close to random walk series. Under this interpretation, the only way you’re going to beat the market is by getting superior information, or perhaps trading against “dumb money” in the market. Yet in the long run, there is an amazing amount of structure in equity prices—in particular, a long-run cointegrating relationship between earnings and prices. This famous observation comes from John Cochrane’s 1994 paper.

In the case of Lendable, we found very early on that aggregating beyond the customer-level was extremely dangerous. In particular, customers tend to “atrophy” in their repayment behaviour as time passes. Customers who’ve been with the firm for longer are more likely to miss payments and default than customers who’ve just signed on. At the same time, the firms we lend to are growing incredibly fast. This means that at any time, there are many new, enthusiastic customers in a portfolio, lowering aggregate loss rates. But these aggregate loss rates don’t actually give much of an insight into the economic viability of the lender, hence the need to model individual repayments.

Your data are not IID. Stan can help

When Lendable started our modeling of individual repayments for debt portfolios, we were using more traditional machine learning tools: trees, nets, regularized GLMs etc. We made the decision to switch over to Stan after realising that modeling individual repayments were quite highly correlated with another. Take two super-simple data generating processes.

The first assumes that shocks to repayment are uncorrelated across people

y 1, i, t = μ + ϵ i, t

The second assumes that every person receives a “time shock” and an idiosyncratic shock

y 2, i, t = μ t + ϵ i, t

with

μ t \sim N (μ, σ μ)

and

ϵ i, t \sim N (0, σ ϵ)

If there are N customers, what is the variance of their cumulative payments under these two specifications?

Within a period, repayments under the first process are distributed

Y 1, t = \sum i = 1 N y 1, i, t \sim N (N μ, N σ 2 ϵ ‾ ‾ ‾ ‾ ‾ \sqrt)

which gives the aggregate payments between

t = 0

and

T

Y 1 = \sum t = 0 T Y 1, t \sim N (T \times N \times μ, T \times N \times σ 2 ϵ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt)

Now how about under the second DGP, in which repayments are correlated? Now we have

Y 2 = \sum t = 0 T Y 2, t \sim N (T \times N \times μ, T \times N \times (σ 2 ϵ + σ 2 μ) ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt)

That is, under the first DGP, our uncertainty around future repayments is smaller than under the second DGP, by the ratio of the scales

T \times N \times σ 2 ϵ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt T \times N \times ( σ 2 ϵ + σ 2 μ ) ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt

A back of the envelope rule of thumb we use is that the scale of portfolio shocks is roughly half the sale of individual shocks, meaning that methods which do not account for the correlation massively under-count observed volatility.

σ2μ
σ2μ

Now the models we use at Lendable aren’t quite as simple as this example, but we adhere to the general gist, modeling portfolio shocks independently from idiosyncratic shocks. The problem we found before migrating to Stan was that essentially all black-box machine learning techniques assume conditional iid data (a la the first model), and so our predictive densities were far too confident. If you’re making pricing decisions based on predictive densities, this is highly undesirable—you are not pricing in known risks, let alone unknown risks.

After realising that essentially no typical machine learning methods were able to take into account the sorts of known risks we wanted to model, we started looking for solutions. One early contender was the lme4 package in R, which estimates basic mixture models (like model 2 above) using an empirical-Bayes-flavoured approach. The issue with this was that as the dimension of the random effects gets large, it tends to underestimate

σ_{μ}

, which kind of defeats the purpose.

The other issue was that standard mixture models assume that the data are conditionally distributed according to some nice, parametric distribution. When you’re modeling loan repayments, this is not what your data look like.

Thinking generatively affects how you think about the model specification

The mindset for many in the “data science” scene, especially in finance, is “how can I use machine learning tools to discover structure in my data?” I’d caution against this approach, advocating instead a mindset of “what structure might give rise to the data that I observe?” Lendable’s experience in modeling small loans in frontier markets suggests that this latter approach can yield very good predictions, interpretable results, and a mature approach to causal inference.

When building a generative model, it’s wise to first plot the data you want to model. For instance, weekly repayments on loans might have a density looks like the black line in the following chart:

You can see that it is a multi-modal series. There’s a clear spike at 0—these loans have no fixed repayment schedule, so it’s quite common for people to pre-pay a big chunk in one month, then not pay in future months, or to not pay during periods in which they don’t have much money (say, between harvests). There are also two different modes, and a very long right hand tail.

When you look at a density like this, you can be pretty sure that it’ll be impossible to find covariates such that the residuals are normal. Instead, you have to do some deeper generativist reasoning. What this density suggests is that there are people paying zero (for whatever reason), some others paying at the modes—these folks are actually paying the “due amount”—and some other people paying much more than the due amount.

One extremely useful technique is to decompose the problem into two nested problems: the first is a model of which group each belongs to, the second is their predicted payment conditional on the group they’re in. This isn’t my idea; Ben Goodrich taught it to me, and it has been written about in Rachael Meager’s excellent job-market paper.

So how do you go about implementing such a model in Stan? The basic idea is that you have two dependent variables: one tells us the “class” that the observation belongs to (expressed as an integer), and the other the actual outcome.

data {
  // ...
  int class[N];
  vector[N] outcome;
}
//... 
model {
  // priors here
  
  class ~ categorical(softmax(mean_vector));
  
  for(n in 1:N) {
      if(class[n]==1) {
        outcome[n] ~ my_model_for_class_1();
      } else if(class[n]==2) {
        outcome[n] ~ my_model_for_class_2();
      }
      // ... 
  }
}

Of course we could implement something like this outside of Stan, by modeling classes and conditional outcomes separately, but if we’re using random effects in each model and want to estimate their correlation (yes, we want to do both of these things!) we have to use Stan.

Treatments affect a distribution, not just its expected value

A big part of the modeling in industry is evaluating the impact of interventions using randomized control trials (in industry, known as A/B tests). Sadly, it is common to use very basic estimates of causal impacts from these tests (like difference in means, perhaps “controlling” for covariates). These commonly used techniques give away a lot of information. A treatment might affect the expected value of the outcome, but is that all we’re interested in?

Using the example of small loans repayments, what if the impact of a lender’s call-center management impacts repayments, but mainly by changing the number of people paying zero in a given month. Or perhaps the treatment affects customers’ probability of making a large payment? Or maybe it convinces those who were going to make a large payment already to make a larger payment? Building a generative model like the one discussed above and including an experimental treatment in each sub-model allows us to not only get more precise estimates of the treatment effect, but also to understand how the causal impact of a treatment affects the outcome.

Conclusion

At Lendable we care very much about making investment decisions given uncertainty. Models will tend to not include risks that haven’t taken place in the historical data. But we see models all the time that don’t take into account the information we have in the data, let alone any of these Black Swans. Stan really makes building models with this property quite easy. If you’re keen on starting to learn, then get in touch, and I’ll give you a project!

I have many.

1 comment:

jwdinkJanuary 18, 2017 at 2:51 PM
Re: constrained outcomes when modeling joint distributions:

What if one of the outcomes you're modeling is constrained due to censoring and/or truncation (e.g., time-to-default as a right-censored outcome variable)? Am I correct in understanding that there is (currently) no workaround in Stan that allows you to model an outcome variable like this jointly with other outcome variables?

Modern Statistical Workflow

Tuesday, January 17, 2017

Building useful models for industry—some tips

Jim Savage

January 2017