Thursday, January 19, 2017

Jeff Alstott

This overview of prediction was Part 1 of a larger report on forecasting technology, which is available here. A PDF of this overview is here and the rest of the report is in review.

Let us define what we mean here by “prediction.” To predict something is to say that a specific thing will happen. Examples of prediction in technology include:

1. A computer will beat the best human player in chess by the year 2000
Ray Kurzweil, 19901
(The computer DeepBlue beat world champion Gary Kasparov in 1997)
2. Mail will be delivered between New York and Australia by guided missiles by 1969
Arthur Summerfield, US Postmaster General, 19592
(This didn’t happen)
3. 25% of all cars will be electric by 2040
Bloomberg New Energy Finance report, 20163
(Outcome pending)

These predictions are specific and concrete enough to be objectively verifiable. Statements about the future can be used for many things, such as inspiring people or helping them make decisions4, but here we are interested in predictions for their ability to be objectively tested by comparing them to the real world. If the prediction matches the real world, the prediction was correct, and if it doesn’t, the prediction was incorrect. Correct predictions may then be used for other purposes like making decisions or inspiration, but their usefulness will stem from their accuracy, not other factors.

Most predictions, particularly successful predictions, are based on a model of the real world. For example, if someone asks you if juice made of apples would taste good, you probably first imagine the juice, and then answer the question. That imagined food is a model of the world, in your head. The model has the components of the juice, how they taste in your mouth, and what you yourself think of that taste. We make models like this all the time; whenever we imagine, plan, forecast, or predict, we are typically using models. These models might be implicit, but they’re there.

What’s Inside a Model

Causes: Not All of Them

Sometimes, however, those hidden chains of causality are not persistent, and so they start to matter. For example, “vinegar tastes sour” is a good predictive model, for the same reason that “apple juice tastes sweet” is a good predictive model: the causes underneath that statement are persistent. But if you eat the fruit of synsepalum dulcificum, often called “miracle fruit,” then our model for vinegar breaks. Miracle fruit contains the molecule miraculin, which binds to the taste receptors that typically bind to sugar. Miraculin stays bound to those receptors, and if you then eat vinegar, a miracle occurs: the normally sour vinegar tastes sweet. The miraculin molecule interacts with the molecules of the vinegar (or any sour food) in a way which causes the sugar-sensitive taste receptor to activate, beginning the chain of causes that lead your mind to experience “tastes sweet.” Your model of “vinegar tastes sour” now predicts the wrong thing, because it did not include the relevant causes.

Essentially all models of the physical world have hidden causality. The process of finding these hidden causes is frequently called “science.” Science involves proposing new models of the world (hypotheses) that yield different predictions from our old models, then testing to see if those new predictions are correct (experiments). For example, the old model of “vinegar tastes sour, regardless of having eaten a miracle fruit” would get supplanted by “vinegar tastes sour typically, but tastes sweet after having eaten a miracle fruit,” because the second model gave better predictions. Science could then go deeper to understand what the causes were that led vinegar to taste sweet with the miracle fruit, finding previously hidden mechanisms like taste receptors, molecules, etc. The same inquiry could be used to understand why vinegar didn’t taste sweet without the miracle fruit! Through the scientific process we can get ever more accurate understandings of the causes in the world, which allow us to create models that give us ever more accurate predictions.

Predictors: Maybe Causes, Maybe Not

Consider two bottles, one labeled “apple juice” and one labeled “vinegar.” You have not eaten a miracle fruit, and none are around. Which bottle’s contents will taste sweet, and which will taste sour? Why? In this case, drinking the bottle labeled “apple juice” will lead you to taste sourness, and drinking the bottle labeled “vinegar” will lead you to taste sweetness. The reason, of course, is the bottle labeled “apple juice” actually contained vinegar, and the bottle labeled “vinegar” actually contained apple juice. You know why changing the labels did not cause a change in taste: bottle labels do not cause taste (unless you eat the label). But even knowing this, you may have used the bottle label to predict the taste. You may have a model that says bottles are labeled “apple juice” and their contents taste sweet due to the same common cause: actually containing apple juice. Seeing a bottle labeled “apple juice” led you to infer that it contained apple juice, and thus that the contents were sweet. You would not have thought that the bottle label caused the taste, but you would have thought it predicted the taste. Thus, the bottle label was a predictor with a predictive relationship with the thing to be predicted. In contrast, “drinking apple juice causes a sensation of sweet taste” considers the apple juice also as a predictor, but it has an explanatory or causal relationship with the thing predicted5.

Predictive relationships can be valuable because they can, of course, lead to correct predictions. These predictions are not guaranteed, however, since other forces can get in the way and break the apparent relationship between a predictor and the predicted thing, as happened with the bottle labels. This most notably happens if we confuse a predictor for a cause and assume that manipulating that predictor ourselves will change the outcome. Making that mistake might lead us to write “apple juice” on a bottle and pour water into it, then wonder why the liquid inside doesn’t taste sweet. Predictors are very useful, but we should not assume they are necessarily causes.

Associational Models: Making Acausal Predictions and Inspiring Causes

Not all models try to understand causes; sometimes we construct models solely for the purpose of prediction. In these prediction-focused models there may be no explicit causal structure, but simply associations between one or more variables and the thing to predict. These are acausal, associational models. For example, you may have noticed that all bottles labeled “apple juice” that you ever encountered had contents that tasted sweet, and use that fact alone to predict that any bottle in the future labeled “apple juice” will have contents that taste sweet. The current exemplar for acausal, associational models is called “machine learning”6; machine learning techniques essentially transform many kinds of data in increasingly-sophisticated ways to find associations between the thing to be predicted (like your risk of a car accident) and many possible predictor variables (age, income, education, car, etc.). In technology development, a long-used predictive model is trend extrapolation. In trend extrapolation the thing to be predicted is the change in performance of a technology (like computer performance) and the predictor variable is the amount of time. Predictive models like these are then tested by predicting on new data (such as data from the future), to test if the associations in the original data that the model was trained on continue to hold for new data. If the associations continue to hold, predictive models can make good predictions for new parts of the world, which is their purpose.

Calibration

All participants are rewarded for using well-calibrated models; they will state their confidence at the level they think they should be confident. If they fall out of calibration, they will either go bankrupt or leave winnings on the table.

Aggregation

In a prediction market virtually any set of models can compete against each other. Prediction markets are typically thought of as integrating the opinions of humans, which are frequently implicit models. However, those humans can also rely on the predictions of explicit models, which may be computational models structured completely differently from the humans’ mental models. A significant contribution to prediction markets are the human judgments about which explicit models are useful, and how much to believe them. Furthermore, in an open prediction market, models from many different domains can potentially be brought in. Even if the organizers of the market are only considering human predictors from one area, like academia, people from other areas, like industry, can potentially enter and reap rewards.

Prediction markets are an aggregation technique that gives more weight to well-calibrated and accurate models, which has proven to be very general and powerful. However, they are not a predictive model themselves. Prediction markets still leave room for improving the object-level models being aggregated, which improve the performance of the whole market. For example, a well-calibrated market can predict an event will occur with 90% probability, and such events will actually occur 90% of the time. However, if the individual members of the market were more knowledgeable, the market could be more precise: several events that were all previously trading at 90% could split into two groups trading at 99% and 1%. The market could still be well-calibrated: the events traded at 99% could still occur 99% of the time, and the events traded at 1% could still occur 1% of the time, even though previously they were all trading at 90%, with good calibration. This is equivalent to narrowing the confidence intervals of a prediction; the predictions become more precise. Prediction markets are great at achieving calibration, but they are only as accurate as the population of models inside them allows them to be. Thus, even if one is using prediction markets to aggregate predictions, there is still utility to developing and disseminating better object-level models.

1. Ray Kurzweil, The Age of Intelligent Machines (The MIT Press, 1990).
2. David William English, The Air Up There (McGraw Hill Professional, 2003).
3. “New Energy Outlook 2016” (Bloomberg New Energy Finance, June 2016).
4. Persistent Forecasting of Disruptive Technologies (Washington, D.C.: National Academies Press, 2009), http://www.nap.edu/catalog/12557.
5. Galit Shmueli, “To Explain or to Predict?” Statist. Sci. 25, no. 3 (August 2010): 289–310, doi:10.1214/10-STS330.
6. Machine learning includes such disparate methods as artificial neural networks, random forests, and support vector machines. They all transform and associate data without assuming any particular causal structure.
7. M. Mitchell Waldrop, “The Chips Are Down for Moore’s Law,” Nature 530, no. 7589 (February 9, 2016): 144–47, doi:10.1038/530144a.
8. “The Cost of Sequencing a Human Genome,” July 6, 2016, https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/.
9. Randall Munroe, “Xkcd: Correlation,” March 6, 2009, https://xkcd.com/552/.
10. A. Philip Dawid, “The Well-Calibrated Bayesian,” Journal of the American Statistical Association 77, no. 379 (1982): 605–10, http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1982.10477856.
11. This whole process is what happens in Bayesian statistics.
12. Shannon R. Fye et al., “An Examination of Factors Affecting Accuracy in Technology Forecasts,” Technological Forecasting and Social Change 80, no. 6 (July 2013): 1222–31, doi:10.1016/j.techfore.2012.10.026; Carie Mullins, “Retrospective Analysis of Technology Forecasting: In-Scope Extension,” August 13, 2012.
13. J. Doyne Farmer and François Lafond, “How Predictable Is Technological Progress?” Research Policy 45, no. 3 (April 2016): 647–65, doi:10.1016/j.respol.2015.11.001.
14. Stephen Thornton, “Karl Popper,” in The Stanford Encyclopedia of Philosophy, ed. Edward N. Zalta, Winter 2015, 2015, http://plato.stanford.edu/archives/win2015/entries/popper/.
15. Alan Musgrave and Charles Pigden, “Imre Lakatos,” in The Stanford Encyclopedia of Philosophy, ed. Edward N. Zalta, Summer 2016, 2016, http://plato.stanford.edu/archives/sum2016/entries/lakatos/.
16. There are many.
17. The data sets most associated with science are randomized control trials, which aim to control for all possible factors by randomly assigning objects to be tested in some way. This works great if you can readily place certain properties on objects, like taking an apple and making it ripe or unripe. However, this isn’t always so easy, so other kinds of data construction are also useful tools.
18. Or at least shares a common cause with the apple juice taste.
19. George E. P. Box and Norman R. Draper, Empirical Model-Building and Response Surfaces, 1 edition (New York: Wiley, 1987).
20. Philip E. Tetlock and Dan Gardner, Superforecasting: The Art and Science of Prediction (Crown, 2015).
21. K. B. Laskey, R. Hanson, and C. Twardy. “Combinatorial Prediction Markets for Fusing Information from Distributed Experts and Models”. In: 18th International Conference on Information Fusion. July 2015, pp. 1892– 1898.
This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

Tuesday, January 17, 2017

Building useful models for industry—some tips

This post is primarily aimed at those interested in building models for use in industry—in particular, finance/banking and consulting. There are many interesting domains for modeling outside these fields, but I know nothing about those. More than anything, this is a collection of modeling tips, tricks and lessons that I’ve gathered from the problems that we have come across at Lendable.
Some context. Lendable, the firm I work for, lends money to finance firms in sub-Saharan Africa. One of the things we’re interested in when we make loans is whether and how much our customers’ customers will repay on their (small) loans. Mobile payments are the primary method for paying back these loans, so there is typically a very good paper-trail of historical payments. We use this paper-trail to build models of borrower behaviour. All these models are implemented in Stan. It would be far harder to do so in other languages.
I’d like to cover a few aspects to this work. They are:
1. Model scoping. Models in industry are often used to generate multiple outputs for many parties. We need to model these jointly;
2. Aggregating data can throw away information, but can also throw away noise;
3. Your datapoints are probably not iid. In finance, using an iid assumptions where the DGP is not is a recipe for disaster;
4. Thinking generatively affects how you think about the model specification;
5. Treatments affect a distribution, not just its expected value.

Model scoping

A very typical setting in industry/government is that a business problem has many possible questions, and so the modeler builds a suitably general model that should be able to provide answers to these (many, possibly quite different) questions. What’s more, we want distinct questions to be answered by a model in a consistent way.
As an example, take the Lendable Risk Engine, a model I developed over the last couple of years. This is a model of repayments on micro-loans in Sub-Saharan Africa. Many of these loans have no repayment schedule (so-called Pay-as-you-go), with borrowers paying to “unlock” the asset (the underlying asset might be a solar panel, or water pump). When Lendable is looking to purchase a portfolio of these loans or make a collections recommendation, we might have many questions:
• Which customers do we expect will repay the remaining amount of their loans?
• What proportion of the remaining balance of a portfolio of loans do we expect will be paid?
• What is the expected timing of these cash flows?
• How correlated do I expect individual loan repayments to be a) today, b) in a stressed scenario?
• How many defaults do I expect to occur? How much do these defaulting customers owe?
• What impact on a) an individual, b) the portfolio do I expect will happen from some intervention/change in collections style?
We want the answers to these questions to agree. If I ask “what is the 10th percentile of predicted portfolio cashflows?” I want to know the number of defaults associated with that scenario. Consequently, we need to model the answers to our questions jointly.

So, how do we model our outcomes jointly?

A straightforward way of implementing a joint model is to model a vector of outcomes rather than a single outcome. For continuous-outcomes, Stan allows us to use multivariate normal or multivariate Student-t errors. If our vector of outcomes for person  at time  is  and our vector of predictors is , the model might be something like:


Where  is a vector of intercepts for customer  is their matrix of coefficients, and  is a covariance matrix, which we normally decompose into a scale vector  and correlation matrix  such that  .. Implementing this model directly in Stan is possible and straightforward, but the sampling performance can be a bit slow. You’ll get a large speedup by taking a Cholesky factor of the covariance matrix  and defining


We do this in the transformed parameters block. In the model block, we then say:


What about when my outcome is constrained or not continuous?

We’ll talk about this more later, but there are two important rules of thumb when model building
• Your model should place non-zero probabilistic weight on outcomes that do happen
• Your model should not place weight on impossible outcomes.
If we have outcomes that are non-continuous, or constrained, a model with multivariate normal or Student-t errors will be severely mis-specified. So what should we do?
For constrained continuous outcomes , we can create an variable  that is on the unconstrained scale.
• For outcomes with a lower bound , we can define an unconstrained variable  or  for some parameter  (a Box-Cox transformation).
• If the variable with an upper bound , we make the transformation as  or
• For a series  constrained between a lower value  and upper value , we can create a new unconstrained variable .
Once we have converted our constrained variable  into an unconstrained variable , we can include it in the outcome vector and model as above.
A similar issue occurs when we we have a combination of continuous and binary or categorical outcomes. This is where the magic of Stan really comes to the fore. We can build transformed parameters that combine known data and unknown parameters into a single object, and then specify the joint probability of that object. An example might be illustrative.
Say we have a vector of continuous outcomes for period  (maybe you can think of this as being a vector of macroeconomic aggregates), and a single binary outcome  (maybe an NBER recession period). Unfortunately we can’t include  in a joint continuous model of  as joint distributions of discrete and continuous variables don’t exist in Stan. But what we can do is create a continuous parameter  and append this to our continuous outcomes  then maps to the discrete outcome  via a probit link function. That is:
// your data declaration here
parameters {
vector[rows(Y)] tau;
}
transformed parameters {
matrix[rows(Y), cols(Y)+1] Y2;
// fill in Y
Y2[1:rows(Y), 1:cols(Y)] = Y;
// fill in tau
Y2[1:rows(Y), cols(Y)+1] = tau;
}
model {
// priors
// ...
// our joint model of Y2 (you should probably optimize this)
Y2 ~ multi_normal(A + BY_{t}, some_covariance);

// the model of the binary outcome
d ~ bernoulli(Phi_approx(col(Y2, cols(Y) + 1)));
}

By specifying the model this way, working out the conditional probability of an NBER-defined recession is simply a matter of simulating forward the entire model for each MCMC draw. This will incorporate uncertainty in the parameters, as well as the uncertainty in the future values of the predictors of a recession (provided a VAR(1) structure is correct—it probably isn’t).

Aggregating data

The big advantage of working in industry is the access to incredibly granular data. For instance, the basic unit of data that we receive is a transaction, which is time-stamped. This is fake data, but it might look like this:
datetransaction_amountcustomer_idtransaction_type
2016-08-11 05:05:2910293211613pmt
2016-08-14 20:50:2910290210277pmt
2016-08-25 12:06:5211549510613refund
2016-08-31 19:40:58108238104pmt
2016-09-06 03:19:098099066521pmt
These are incredibly rich data, and are quite typical of what is available in retail or finance. The problem with it is that they contains a lot of noise.
• The timing between payments might be meaningful, but often contains no information about future payments.
• The precise payment amount on a day might be meaningful, but if customers are invoiced weekly or monthly, we might be missing something. Many customers tend to pay in chunks (several small payments) rather than the monthly due amount, especially in frontier markets where many wages are paid daily.
Because of the noisiness of very granular data, we might more easily extract the signal by aggregating our data. But how much should we aggregate? Should we sum up all payments for each minute (across customers)? For each day or month? Or should we sum within customers across time periods? Of course, your problem will dictate how you might go about this.
One particularly good example of the power of aggregation is in financial markets. In the short run (but not the very short run) equity prices behave as though they’re very close to random walk series. Under this interpretation, the only way you’re going to beat the market is by getting superior information, or perhaps trading against “dumb money” in the market. Yet in the long run, there is an amazing amount of structure in equity prices—in particular, a long-run cointegrating relationship between earnings and prices. This famous observation comes from John Cochrane’s 1994 paper.

In the case of Lendable, we found very early on that aggregating beyond the customer-level was extremely dangerous. In particular, customers tend to “atrophy” in their repayment behaviour as time passes. Customers who’ve been with the firm for longer are more likely to miss payments and default than customers who’ve just signed on. At the same time, the firms we lend to are growing incredibly fast. This means that at any time, there are many new, enthusiastic customers in a portfolio, lowering aggregate loss rates. But these aggregate loss rates don’t actually give much of an insight into the economic viability of the lender, hence the need to model individual repayments.

Your data are not IID. Stan can help

When Lendable started our modeling of individual repayments for debt portfolios, we were using more traditional machine learning tools: trees, nets, regularized GLMs etc. We made the decision to switch over to Stan after realising that modeling individual repayments were quite highly correlated with another. Take two super-simple data generating processes.
The first assumes that shocks to repayment are uncorrelated across people

The second assumes that every person receives a “time shock” and an idiosyncratic shock

with

and

If there are N customers, what is the variance of their cumulative payments under these two specifications?
Within a period, repayments under the first process are distributed

which gives the aggregate payments between  and  as

Now how about under the second DGP, in which repayments are correlated? Now we have

That is, under the first DGP, our uncertainty around future repayments is smaller than under the second DGP, by the ratio of the scales

A back of the envelope rule of thumb we use is that the scale of portfolio shocks is roughly half the sale of individual shocks, meaning that methods which do not account for the correlation massively under-count observed volatility.

σ
2μ
σ
2μ
Now the models we use at Lendable aren’t quite as simple as this example, but we adhere to the general gist, modeling portfolio shocks independently from idiosyncratic shocks. The problem we found before migrating to Stan was that essentially all black-box machine learning techniques assume conditional iid data (a la the first model), and so our predictive densities were far too confident. If you’re making pricing decisions based on predictive densities, this is highly undesirable—you are not pricing in known risks, let alone unknown risks.
After realising that essentially no typical machine learning methods were able to take into account the sorts of known risks we wanted to model, we started looking for solutions. One early contender was the lme4 package in R, which estimates basic mixture models (like model 2 above) using an empirical-Bayes-flavoured approach. The issue with this was that as the dimension of the random effects gets large, it tends to underestimate , which kind of defeats the purpose.
The other issue was that standard mixture models assume that the data are conditionally distributed according to some nice, parametric distribution. When you’re modeling loan repayments, this is not what your data look like.

Thinking generatively affects how you think about the model specification

The mindset for many in the “data science” scene, especially in finance, is “how can I use machine learning tools to discover structure in my data?” I’d caution against this approach, advocating instead a mindset of “what structure might give rise to the data that I observe?” Lendable’s experience in modeling small loans in frontier markets suggests that this latter approach can yield very good predictions, interpretable results, and a mature approach to causal inference.
When building a generative model, it’s wise to first plot the data you want to model. For instance, weekly repayments on loans might have a density looks like the black line in the following chart:

You can see that it is a multi-modal series. There’s a clear spike at 0—these loans have no fixed repayment schedule, so it’s quite common for people to pre-pay a big chunk in one month, then not pay in future months, or to not pay during periods in which they don’t have much money (say, between harvests). There are also two different modes, and a very long right hand tail.
When you look at a density like this, you can be pretty sure that it’ll be impossible to find covariates such that the residuals are normal. Instead, you have to do some deeper generativist reasoning. What this density suggests is that there are people paying zero (for whatever reason), some others paying at the modes—these folks are actually paying the “due amount”—and some other people paying much more than the due amount.
One extremely useful technique is to decompose the problem into two nested problems: the first is a model of which group each belongs to, the second is their predicted payment conditional on the group they’re in. This isn’t my idea; Ben Goodrich taught it to me, and it has been written about in Rachael Meager’s excellent job-market paper.
So how do you go about implementing such a model in Stan? The basic idea is that you have two dependent variables: one tells us the “class” that the observation belongs to (expressed as an integer), and the other the actual outcome.
data {
// ...
int class[N];
vector[N] outcome;
}
//...
model {
// priors here

class ~ categorical(softmax(mean_vector));

for(n in 1:N) {
if(class[n]==1) {
outcome[n] ~ my_model_for_class_1();
} else if(class[n]==2) {
outcome[n] ~ my_model_for_class_2();
}
// ...
}
}
Of course we could implement something like this outside of Stan, by modeling classes and conditional outcomes separately, but if we’re using random effects in each model and want to estimate their correlation (yes, we want to do both of these things!) we have to use Stan.

Treatments affect a distribution, not just its expected value

A big part of the modeling in industry is evaluating the impact of interventions using randomized control trials (in industry, known as A/B tests). Sadly, it is common to use very basic estimates of causal impacts from these tests (like difference in means, perhaps “controlling” for covariates). These commonly used techniques give away a lot of information. A treatment might affect the expected value of the outcome, but is that all we’re interested in?
Using the example of small loans repayments, what if the impact of a lender’s call-center management impacts repayments, but mainly by changing the number of people paying zero in a given month. Or perhaps the treatment affects customers’ probability of making a large payment? Or maybe it convinces those who were going to make a large payment already to make a larger payment? Building a generative model like the one discussed above and including an experimental treatment in each sub-model allows us to not only get more precise estimates of the treatment effect, but also to understand how the causal impact of a treatment affects the outcome.

Conclusion

At Lendable we care very much about making investment decisions given uncertainty. Models will tend to not include risks that haven’t taken place in the historical data. But we see models all the time that don’t take into account the information we have in the data, let alone any of these Black Swans. Stan really makes building models with this property quite easy. If you’re keen on starting to learn, then get in touch, and I’ll give you a project!
I have many.