Thursday, January 19, 2017

How Prediction Works

Jeff Alstott

This overview of prediction was Part 1 of a larger report on forecasting technology, which is available here. A PDF of this overview is here and the rest of the report is in review.

Let us define what we mean here by “prediction.” To predict something is to say that a specific thing will happen. Examples of prediction in technology include:

A computer will beat the best human player in chess by the year 2000
Ray Kurzweil, 1990¹
(The computer DeepBlue beat world champion Gary Kasparov in 1997)
Mail will be delivered between New York and Australia by guided missiles by 1969
Arthur Summerfield, US Postmaster General, 1959²
(This didn’t happen)
25% of all cars will be electric by 2040
Bloomberg New Energy Finance report, 2016³
(Outcome pending)

These predictions are specific and concrete enough to be objectively verifiable. Statements about the future can be used for many things, such as inspiring people or helping them make decisions⁴, but here we are interested in predictions for their ability to be objectively tested by comparing them to the real world. If the prediction matches the real world, the prediction was correct, and if it doesn’t, the prediction was incorrect. Correct predictions may then be used for other purposes like making decisions or inspiration, but their usefulness will stem from their accuracy, not other factors.

Most predictions, particularly successful predictions, are based on a model of the real world. For example, if someone asks you if juice made of apples would taste good, you probably first imagine the juice, and then answer the question. That imagined food is a model of the world, in your head. The model has the components of the juice, how they taste in your mouth, and what you yourself think of that taste. We make models like this all the time; whenever we imagine, plan, forecast, or predict, we are typically using models. These models might be implicit, but they’re there.

What’s Inside a Model

Causes: Not All of Them

In order to be accurate, a model does not have to be a complete model of everything. Your mental model of food likely doesn’t include atoms, and it definitely doesn’t include Jupiter. Some parts of the real world that aren’t included in your model, like Jupiter, have no bearing on what causes food to taste good on Earth, so it’s clear they are not necessary for the model to make good predictions. However, some of the things your model skips over are actually critical to what makes a food taste good, like atoms. Your experience of how a food tastes is determined by how a fat or sugar molecule’s atoms lay inside your taste buds’ receptor molecules. Your experience is also determined by a myriad of other factors, like how the taste buds are coupled to neurons that send a signal into your brain. But our models can still make good predictions without perfectly representing all aspects of the relevant parts of the universe. This is because the world often has persistent relationships. For example, apple juice has sugar molecules in it, and those molecules’ atoms can interact with your taste buds’ receptors in a particular way, and those receptors interact with the neurons in your tongue in a particular way, and that eventually gives your mind a particular sensation of taste. Because this chain of causes in the world is persistent, your model of the world can simply say “apple juice tastes sweet” and make accurate predictions. There can be fathomless depths of causality hiding inside that statement, but because the hidden causal interactions are persistent, they can be compressed. This compressibility of the world is what allows a simple model to make correct predictions.

Sometimes, however, those hidden chains of causality are not persistent, and so they start to matter. For example, “vinegar tastes sour” is a good predictive model, for the same reason that “apple juice tastes sweet” is a good predictive model: the causes underneath that statement are persistent. But if you eat the fruit of synsepalum dulcificum, often called “miracle fruit,” then our model for vinegar breaks. Miracle fruit contains the molecule miraculin, which binds to the taste receptors that typically bind to sugar. Miraculin stays bound to those receptors, and if you then eat vinegar, a miracle occurs: the normally sour vinegar tastes sweet. The miraculin molecule interacts with the molecules of the vinegar (or any sour food) in a way which causes the sugar-sensitive taste receptor to activate, beginning the chain of causes that lead your mind to experience “tastes sweet.” Your model of “vinegar tastes sour” now predicts the wrong thing, because it did not include the relevant causes.

Essentially all models of the physical world have hidden causality. The process of finding these hidden causes is frequently called “science.” Science involves proposing new models of the world (hypotheses) that yield different predictions from our old models, then testing to see if those new predictions are correct (experiments). For example, the old model of “vinegar tastes sour, regardless of having eaten a miracle fruit” would get supplanted by “vinegar tastes sour typically, but tastes sweet after having eaten a miracle fruit,” because the second model gave better predictions. Science could then go deeper to understand what the causes were that led vinegar to taste sweet with the miracle fruit, finding previously hidden mechanisms like taste receptors, molecules, etc. The same inquiry could be used to understand why vinegar didn’t taste sweet without the miracle fruit! Through the scientific process we can get ever more accurate understandings of the causes in the world, which allow us to create models that give us ever more accurate predictions.

Predictors: Maybe Causes, Maybe Not

Consider two bottles, one labeled “apple juice” and one labeled “vinegar.” You have not eaten a miracle fruit, and none are around. Which bottle’s contents will taste sweet, and which will taste sour? Why? In this case, drinking the bottle labeled “apple juice” will lead you to taste sourness, and drinking the bottle labeled “vinegar” will lead you to taste sweetness. The reason, of course, is the bottle labeled “apple juice” actually contained vinegar, and the bottle labeled “vinegar” actually contained apple juice. You know why changing the labels did not cause a change in taste: bottle labels do not cause taste (unless you eat the label). But even knowing this, you may have used the bottle label to predict the taste. You may have a model that says bottles are labeled “apple juice” and their contents taste sweet due to the same common cause: actually containing apple juice. Seeing a bottle labeled “apple juice” led you to infer that it contained apple juice, and thus that the contents were sweet. You would not have thought that the bottle label caused the taste, but you would have thought it predicted the taste. Thus, the bottle label was a predictor with a predictive relationship with the thing to be predicted. In contrast, “drinking apple juice causes a sensation of sweet taste” considers the apple juice also as a predictor, but it has an explanatory or causal relationship with the thing predicted⁵.

Predictive relationships can be valuable because they can, of course, lead to correct predictions. These predictions are not guaranteed, however, since other forces can get in the way and break the apparent relationship between a predictor and the predicted thing, as happened with the bottle labels. This most notably happens if we confuse a predictor for a cause and assume that manipulating that predictor ourselves will change the outcome. Making that mistake might lead us to write “apple juice” on a bottle and pour water into it, then wonder why the liquid inside doesn’t taste sweet. Predictors are very useful, but we should not assume they are necessarily causes.

Associational Models: Making Acausal Predictions and Inspiring Causes

Not all models try to understand causes; sometimes we construct models solely for the purpose of prediction. In these prediction-focused models there may be no explicit causal structure, but simply associations between one or more variables and the thing to predict. These are acausal, associational models. For example, you may have noticed that all bottles labeled “apple juice” that you ever encountered had contents that tasted sweet, and use that fact alone to predict that any bottle in the future labeled “apple juice” will have contents that taste sweet. The current exemplar for acausal, associational models is called “machine learning”⁶; machine learning techniques essentially transform many kinds of data in increasingly-sophisticated ways to find associations between the thing to be predicted (like your risk of a car accident) and many possible predictor variables (age, income, education, car, etc.). In technology development, a long-used predictive model is trend extrapolation. In trend extrapolation the thing to be predicted is the change in performance of a technology (like computer performance) and the predictor variable is the amount of time. Predictive models like these are then tested by predicting on new data (such as data from the future), to test if the associations in the original data that the model was trained on continue to hold for new data. If the associations continue to hold, predictive models can make good predictions for new parts of the world, which is their purpose.

These prediction-focused, associational models can potentially do great prediction, given the right data that shows the right associations between parts of reality. However, they are fragile to changes in the world that weren’t reflected in the data they were trained on, particularly if they rely on predictors that are not the actual causes of the thing to be predicted. For example, a predictive model based on trend extrapolation will predict that a trend in technology will continue as long as the years continue to go by, regardless of data from economics, engineering, or physics. But the actual causes of technology development are not calendar years. The longtime success story of technology trend extrapolation, Moore’s law, has now slowed to what may soon be a halt⁷. But technology can also develop faster than trends predict: genome sequencing prices dropped smoothly between 2001 and 2008 from $100 million to $10 million, but then plummeted to $10 thousand by 2011, 2.5 orders of magnitude below the expectation from the previous trend⁸. Association-based predictive models can give more robust prediction by expanding the amount of data they are trained on. Large amounts of data has been the basis of the current machine learning explosion. However, unless they happen to capture the real causal mechanism as one of their associations, these models still work best with predicting situations that are similar to situations in the data they were trained with (e.g. the same period of history for trend extrapolation). Outside of that regime they are blind.

The solution to models that cannot accurately predict the parts of the world you want to predict is to make better models. To accurately predict a part of the world that is different from previously observed parts of the world, the best hope is to use models that reflect the causal mechanisms of reality. In order to identify these causal mechanisms, we need to do science. We need to propose mechanistic models for how the world works (hypotheses) and use them to make and test falsifiable predictions (experiments). But where do we get these hypotheses from? Frequently, association-based modeling. “Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’”⁹. An association between two variables (like apple juice and sweetness) are data that a causal model would need to explain, and they are frequently the inspiration behind studies that eventually find more detailed mechanisms (like sugar and taste receptors).

What’s Inside an Experiment

Calibration of Uncertainty

Your model of the world may be predictively accurate, in that if a food tastes good in the model it also tastes good in the real world. The model can also be useful without being perfectly accurate. A model that accurately predicted how foods would taste 90% of the time might still be useful, especially compared to a model that was correct 50% of the time. But even better is a model that you know gives correct predictions only 90% of the time. If you think a model is perfectly accurate, you will think you will be able to make decisions that will allow you to enjoy 100% of your meals. But if you know the model is only correct for 90% of foods, you can can anticipate the occasional surprise and prepare accordingly (like having a bottle of mouthwash handy). Conversely, if you think a model is correct 50% of the time and it’s actually 90%, you’re missing out on predictive power; you may decide to just select food by flipping a coin, even though you could use your imperfect model and enjoy 40% more of your meals!

Knowing the accuracy of a model’s predictions is called calibration¹⁰. Calibration is a statement about the certainty or confidence associated with any prediction. Instead of your model simpling predicting “Vinegar will taste bad” and “Apple juice will taste good,” it could instead predict “Vinegar will taste: Bad-90%, Good-10%” and “Apple juice will taste: Good-70%, Bad-30%.” If the model’s predictions with 90% confidence are correct 9 out 10 times, and its predictions with 70% confidence are correct 7 out of 10 times, this model is well-calibrated. Most models of interesting phenomena are imperfect, but the well-calibrated models are more useful because they allow you to correctly predict how often you’ll be right or wrong, on average.

What properties does a model need to have in order for it to be anything other than completely certain? A model could include a mechanism that is variable. For example, if someone rolls a fair die and you need to predict what value will come up, your mental model includes the fact that die rolls are variable, and that they come up 1, 2, 3, 4, 5, and 6 equally often. So your model will predict “1-1/6, 2-1/6, 3-1/6, 4-1/6, 5-1/6, 6-1/6.” This is how a well-calibrated model will predict outcomes that have causes that are variable.

Variability, however, is in the eye of the beholder. Imagine someone rolls a die, looks at it, covers it, and then asks you what the value is. Your mental model has the same information as before, and so it again should predict “1-1/6, 2-1/6, 3-1/6, 4-1/6, 5-1/6, 6-1/6.” The other person, however, saw the die’s value and so has more information. Their mental model will give a different response: “5-100%” These different responses are due to different uncertainty about the world, which is caused by a combination of what data we have observed and what our estimates of the world were prior to receiving that data.

To return to our apple juice, maybe our initial model is “Apple juice tastes: Good-50%, Bad-50%.” We drink three glasses of apple juice and they all taste good, so perhaps we update our model to “Apple juice tastes: Good-88%, Bad-12%,” depending on how strongly you believed in your initial model. Later we drink another glass of more apple juice and it tastes bad, so we update our model to “Apple juice tastes: Good-70%, Bad-30%.” As we drink more and more samples of apple juice, our model’s distribution of good and bad will shift to match the real world variability of apple juice¹¹.

The variability of the real world and the calibration of your model are important for identifying if your model is making true predictions about the world. For example, consider predicting a technology’s future performance using trend extrapolation. Historically, these models predict a value within ±30% of the real value, about 50% of the time¹². Are these good predictions? If these models were 50% confident that the true value would lay within that range, then these models gave true predictions about the world. If these models were 99% certain the true value would lay within that range, then they did not make true predictions about the world. And if the models gave no statement about their certainty, but just predicted a single value at one specific point, then they were likely all completely wrong. Unfortunately, most trend extrapolation has no statement of certainty, and may implicitly fall in the last group. More informative trend extrapolation uses historical data to estimate the variability of the trend, and then uses that variability to generate a distribution of confidence values for all possible levels of future performance. These methods can and have been used for predicting technology development, with promising initial results¹³. Having well-calibrated predictions of future technology allows us to weigh different outcomes by how probable they are, and then prepare for the entire distribution of outcomes accordingly.

Adding Predictors to Reduce Uncertainty and Test Causes

A well-calibrated model that accurately reflects the distribution of good and bad tasting apple juice is useful, because we can use that to make predictions of how often apple juice tastes good or bad. But it would be even more useful if it reflected which of the juice tasted good or bad; i.e. if it made better predictions. Such a model would need more data than just the good and bad taste of all apple juice. It would need additional data, predictors of the apple juice taste. As we’ve discussed, predictors are ideally causes of the apple juice’s taste, but they could also be factors that don’t cause the taste, but have historically associated with it. Whatever their origin, predictors allow a model to segment apple juice down further, such as into green and red apple juice. After collecting data, we may get a model that says “Green apple juice tastes: Good-10%, Bad-90%; Red apple juice tastes: Good-90%, Bad-10%.” This model may make much better predictions due to the lowered uncertainty, which is thanks to the additional predictors.

A predictor may enable you to make better predictions, but what if it’s not a cause? Eventually you may run into a situation like the bottle labels, where your predictive model will fail. How can we determine if using green apples to make apple juice actually causes the bad taste, or if it’s an acausal predictor that has only benefited prediction due to its historical association with bad taste? There are two answers: 1. add more predictors, or 2. run an experiment.

Add More Predictors

We could add another predictor: the ripeness of the apples used to make the juice. Ripe apples can be green or red, but unripe apples can only be green. We collect data on the taste of apple juice, considering the apples’ color and their ripeness, and find: “Unripe green apple juice tastes: Good-0%, Bad-100%; Ripe green apple juice tastes: Good-90%, Bad-10%; Ripe red apple juice tastes: Good-90%, Bad-10%.” Ripeness is what’s important! The color of the apple appears to have no relevance for predicting taste.

Run An Experiment

One might say that adding more predictors is all well and good, but to find the causes we really need to run an experiment. However, that is what we just did: we tested the effect of apple color on juice taste by controlling for the effect of apple ripeness. Testing the influence of one factor while controlling for other factors is an experiment. The process of finding causes with experiments is the process of testing predictors by adding other predictors.

Earlier we introduced “predictors” versus “causes” because the distinction fits our intuitions and reflects the ideal that we are pursuing. Now that we have built up understanding about how we use data to build models of the world, we can handle a more complete picture: in practice, “causes” are not functionally different from other predictors. True causes can be found, but we can never be sure that’s what they are. In our experiments with apple juice we have only controlled for apple color and ripeness; but perhaps the unripe apples actually contain very little juice, so they can’t make the plain water taste good. If we made an apple juice with a great many unripe apples, perhaps that juice would taste good, which our predictive model would have failed to predict correctly. Unripe apples’ possible lack of juice is an example of a confounding factor we can “control for,” alternatively known as “collect data on.” But after collecting this data (or running this experiment) we see that controlling for quantity of juice doesn’t help our predictions. The ripeness of the apples remains a cause of flavor, or so it seems. What we call “causes” are predictors whose predictive power has survived new data, so far. Practically, causes are not so much “true” as they are “tough”¹⁴.

Science finds new data, like the miracle fruit’s effect on taste, that we could not previously predict. There is already no shortage of such data; nearly every model is created with an imperfect ability to predict¹⁵. Science is also the process of finding factors that, when we account for them, yield better prediction. In practice what sets apart “doing science” from making acausal predictive models is deciding what data to collect. There is a difference between having data from many different predictors and having data that can be used to distinguish which of those predictors can be explained by the others. Perhaps in our experiment with the unripe apples all the unripe apples had been stored in a dirty kitchen sink, marring their flavor. If we are creating an acausal, associational model we would not need to care; the model may still make good predictions, up until we start predicting the taste of apple juice from a different kitchen. But if our goal is to identify possible causes, then we must consider such potential confounders and other ways the data could be misrepresenting reality¹⁶. The work of science is then to carefully construct a data set that makes clear what is the influence of each predictor, including the kitchen sink¹⁷.

When we think of new factors to control, we are essentially proposing different causal models (often called hypotheses). If we think that data on the kitchen sink may enable better predictions, we are proposing a causal model in which the kitchen sink is a cause. If we think that data on the person making the juice may enable better prediction, such as whether they are right- or left-handed, we are considering a causal model in which handedness is a cause¹⁸. There are infinitely many causal models that explain any data set perfectly. How do we know which of these causal models are credible? By testing their predictions against new data that is different from what we have seen before. New data that shows new associations shrinks the space of causal models that are plausible. When we design and run an experiment we are constructing a set of data that will reduce our uncertainty in what causal models are probable. We have historically found that simpler models do better on accurately predicting new data, and so we often have a prior assumption that the simpler causal models are more probable (“Occam’s Razor”). However, if the data of the real world is shown to have many complexities, like with miracle fruit, then the most probable surviving models may also have many complexities, like taste receptors and miraculin molecules and neurons.

We never know for certain what the true causes are. As with the miracle fruit, there is always the opportunity for some part of the world to hinge on a hidden cause that wasn’t in the models that previously seemed most probable. As such, the goal of predictive modeling is not to be right, but to be less wrong. “All models are wrong, but some are useful”¹⁹. With the continued thoughtful inspection of purposefully collected data, we can create models of reality that give ever better prediction and are ever more useful.

A Note on Qualitative and Quantitative Models

When you make a mental model, like a model of what apple juice will taste like, it often feels different from a model that you have written down. This is particularly true if the model written outside your head has been formalized to include explicit rules, like “Apple juice always taste sweet” or perhaps “IF consume apple juice THEN taste sweetness.” Your mental model certainly feels different if the written rule is “Consume apple juice \implies p(taste sweetness)=0.873.”

The difference between mental models and formalized models has often been cast as “qualitative” vs. “quantitative” models. However, qualitative and quantitative models can act a lot alike. The gut predictions of a well-trained expert have a lot of the same properties as a well-trained artificial neural network. Both human minds and artificial networks can contain very accurate predictive models of the world, and they are created by observing potentially vast amounts of data. More importantly, in both cases the internal dynamics of the model can be unknown. A human expert can make an accurate prediction just because it feels right, and artificial neural networks can be horribly opaque in the meaning of what is being calculated inside. These two kinds of models are implicit: the structure of the model is unknown or unclear. Implicit models are in contrast to explicit models, where the structure of the model is known. The main thing that functionally sets qualitative models apart from many quantitative models is that their internal structure is unclear; a human may be able to somewhat explain their gut judgment, but often not completely enough to enable reproducing the model elsewhere.

Explicit models are more easily scrutinized, and thus more easily tested and improved. The implicit models of human judgment can be worse than explicit models if the latter has been repeatedly refined. However, when people are able to update their judgment in response to new data, then the implicit mental models of humans can still be part of the the world’s best predictors²⁰. But regardless of whether models are explicit or implicit, quantitative or qualitative, for the purpose of making predictions they can be functionally the same. They all make predictions about the world, they all can be accurate or inaccurate, they all can be well- or poorly-calibrated, and they all can be updated in response to new data.

Aggregating Models with Prediction Markets

Prediction markets are a tool that is being increasingly used to make accurate and well-calibrated predictions about the world, such as the SciCast market that predicted science and technology events²¹. In prediction markets, participants bet on the outcome of some event, which pits their predictive models against each other. For example, you may bet me $9 against my $1 that a glass of apple juice will taste sweet. The prevailing betting odds on an event’s outcome is the “market’s prediction” of the probability of the outcome (in this case, 90%). Prediction markets have done very well for two mains reasons: calibration and aggregation.

Calibration

All participants are rewarded for using well-calibrated models; they will state their confidence at the level they think they should be confident. If they fall out of calibration, they will either go bankrupt or leave winnings on the table.

Aggregation

In a prediction market virtually any set of models can compete against each other. Prediction markets are typically thought of as integrating the opinions of humans, which are frequently implicit models. However, those humans can also rely on the predictions of explicit models, which may be computational models structured completely differently from the humans’ mental models. A significant contribution to prediction markets are the human judgments about which explicit models are useful, and how much to believe them. Furthermore, in an open prediction market, models from many different domains can potentially be brought in. Even if the organizers of the market are only considering human predictors from one area, like academia, people from other areas, like industry, can potentially enter and reap rewards.

Prediction markets are an aggregation technique that gives more weight to well-calibrated and accurate models, which has proven to be very general and powerful. However, they are not a predictive model themselves. Prediction markets still leave room for improving the object-level models being aggregated, which improve the performance of the whole market. For example, a well-calibrated market can predict an event will occur with 90% probability, and such events will actually occur 90% of the time. However, if the individual members of the market were more knowledgeable, the market could be more precise: several events that were all previously trading at 90% could split into two groups trading at 99% and 1%. The market could still be well-calibrated: the events traded at 99% could still occur 99% of the time, and the events traded at 1% could still occur 1% of the time, even though previously they were all trading at 90%, with good calibration. This is equivalent to narrowing the confidence intervals of a prediction; the predictions become more precise. Prediction markets are great at achieving calibration, but they are only as accurate as the population of models inside them allows them to be. Thus, even if one is using prediction markets to aggregate predictions, there is still utility to developing and disseminating better object-level models.

Ray Kurzweil, The Age of Intelligent Machines (The MIT Press, 1990).↩
David William English, The Air Up There (McGraw Hill Professional, 2003).↩
“New Energy Outlook 2016” (Bloomberg New Energy Finance, June 2016).↩
Persistent Forecasting of Disruptive Technologies (Washington, D.C.: National Academies Press, 2009), http://www.nap.edu/catalog/12557.↩
Galit Shmueli, “To Explain or to Predict?” Statist. Sci. 25, no. 3 (August 2010): 289–310, doi:10.1214/10-STS330.↩
Machine learning includes such disparate methods as artificial neural networks, random forests, and support vector machines. They all transform and associate data without assuming any particular causal structure.↩
M. Mitchell Waldrop, “The Chips Are Down for Moore’s Law,” Nature 530, no. 7589 (February 9, 2016): 144–47, doi:10.1038/530144a.↩
“The Cost of Sequencing a Human Genome,” July 6, 2016, https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/.↩
Randall Munroe, “Xkcd: Correlation,” March 6, 2009, https://xkcd.com/552/.↩
A. Philip Dawid, “The Well-Calibrated Bayesian,” Journal of the American Statistical Association 77, no. 379 (1982): 605–10, http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1982.10477856.↩
This whole process is what happens in Bayesian statistics.↩
Shannon R. Fye et al., “An Examination of Factors Affecting Accuracy in Technology Forecasts,” Technological Forecasting and Social Change 80, no. 6 (July 2013): 1222–31, doi:10.1016/j.techfore.2012.10.026; Carie Mullins, “Retrospective Analysis of Technology Forecasting: In-Scope Extension,” August 13, 2012.↩
J. Doyne Farmer and François Lafond, “How Predictable Is Technological Progress?” Research Policy 45, no. 3 (April 2016): 647–65, doi:10.1016/j.respol.2015.11.001.↩
Stephen Thornton, “Karl Popper,” in The Stanford Encyclopedia of Philosophy, ed. Edward N. Zalta, Winter 2015, 2015, http://plato.stanford.edu/archives/win2015/entries/popper/.↩
Alan Musgrave and Charles Pigden, “Imre Lakatos,” in The Stanford Encyclopedia of Philosophy, ed. Edward N. Zalta, Summer 2016, 2016, http://plato.stanford.edu/archives/sum2016/entries/lakatos/.↩
There are many.↩
The data sets most associated with science are randomized control trials, which aim to control for all possible factors by randomly assigning objects to be tested in some way. This works great if you can readily place certain properties on objects, like taking an apple and making it ripe or unripe. However, this isn’t always so easy, so other kinds of data construction are also useful tools.↩
Or at least shares a common cause with the apple juice taste.↩
George E. P. Box and Norman R. Draper, Empirical Model-Building and Response Surfaces, 1 edition (New York: Wiley, 1987).↩
Philip E. Tetlock and Dan Gardner, Superforecasting: The Art and Science of Prediction (Crown, 2015).↩
K. B. Laskey, R. Hanson, and C. Twardy. “Combinatorial Prediction Markets for Fusing Information from Distributed Experts and Models”. In: 18th International Conference on Information Fusion. July 2015, pp. 1892– 1898.↩

This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

Tuesday, January 17, 2017

Building useful models for industry—some tips

This post is primarily aimed at those interested in building models for use in industry—in particular, finance/banking and consulting. There are many interesting domains for modeling outside these fields, but I know nothing about those. More than anything, this is a collection of modeling tips, tricks and lessons that I’ve gathered from the problems that we have come across at Lendable.

Some context. Lendable, the firm I work for, lends money to finance firms in sub-Saharan Africa. One of the things we’re interested in when we make loans is whether and how much our customers’ customers will repay on their (small) loans. Mobile payments are the primary method for paying back these loans, so there is typically a very good paper-trail of historical payments. We use this paper-trail to build models of borrower behaviour. All these models are implemented in Stan. It would be far harder to do so in other languages.

I’d like to cover a few aspects to this work. They are:

Model scoping. Models in industry are often used to generate multiple outputs for many parties. We need to model these jointly;
Aggregating data can throw away information, but can also throw away noise;
Your datapoints are probably not iid. In finance, using an iid assumptions where the DGP is not is a recipe for disaster;
Thinking generatively affects how you think about the model specification;
Treatments affect a distribution, not just its expected value.

Model scoping

A very typical setting in industry/government is that a business problem has many possible questions, and so the modeler builds a suitably general model that should be able to provide answers to these (many, possibly quite different) questions. What’s more, we want distinct questions to be answered by a model in a consistent way.

As an example, take the Lendable Risk Engine, a model I developed over the last couple of years. This is a model of repayments on micro-loans in Sub-Saharan Africa. Many of these loans have no repayment schedule (so-called Pay-as-you-go), with borrowers paying to “unlock” the asset (the underlying asset might be a solar panel, or water pump). When Lendable is looking to purchase a portfolio of these loans or make a collections recommendation, we might have many questions:

Which customers do we expect will repay the remaining amount of their loans?
What proportion of the remaining balance of a portfolio of loans do we expect will be paid?
What is the expected timing of these cash flows?
How correlated do I expect individual loan repayments to be a) today, b) in a stressed scenario?
How many defaults do I expect to occur? How much do these defaulting customers owe?
What impact on a) an individual, b) the portfolio do I expect will happen from some intervention/change in collections style?

We want the answers to these questions to agree. If I ask “what is the 10th percentile of predicted portfolio cashflows?” I want to know the number of defaults associated with that scenario. Consequently, we need to model the answers to our questions jointly.

So, how do we model our outcomes jointly?

A straightforward way of implementing a joint model is to model a vector of outcomes rather than a single outcome. For continuous-outcomes, Stan allows us to use multivariate normal or multivariate Student-t errors. If our vector of outcomes for person

i

at time

t

Y_{i, t}

and our vector of predictors is

X_{i, t}

, the model might be something like:

Where

A_{i}

is a vector of intercepts for customer

i

B_{i}

is their matrix of coefficients, and

Σ

is a covariance matrix, which we normally decompose into a scale vector

τ

and correlation matrix

Ω

such that

Σ = diag (τ) Ω diag (τ)

.. Implementing this model directly in Stan is possible and straightforward, but the sampling performance can be a bit slow. You’ll get a large speedup by taking a Cholesky factor of the covariance matrix

L = chol (Σ)

and defining

We do this in the transformed parameters block. In the model block, we then say:

What about when my outcome is constrained or not continuous?

We’ll talk about this more later, but there are two important rules of thumb when model building

Your model should place non-zero probabilistic weight on outcomes that do happen
Your model should not place weight on impossible outcomes.

If we have outcomes that are non-continuous, or constrained, a model with multivariate normal or Student-t errors will be severely mis-specified. So what should we do?

For constrained continuous outcomes

y_{1}

, we can create an variable

y_{2}

that is on the unconstrained scale.

For outcomes with a lower bound $a$ , we can define an unconstrained variable $y_{2} = l o g (y_{1} - a)$ or $y_{2} = \frac{(y_{1} - a)^{λ} - 1}{λ}$ for some parameter $λ$ (a Box-Cox transformation).

If the variable with an upper bound $b$ , we make the transformation as $y_{2} = l o g (b - y_{1})$ or $y_{2} = \frac{(b - y_{1})^{λ} - 1}{λ}$
For a series $y$ constrained between a lower value $a$ and upper value $b$ , we can create a new unconstrained variable $y_{2} = l o g i t (\frac{y_{1} - a}{b - a})$ .

Once we have converted our constrained variable

y_{1}

into an unconstrained variable

y_{2}

, we can include it in the outcome vector and model as above.

A similar issue occurs when we we have a combination of continuous and binary or categorical outcomes. This is where the magic of Stan really comes to the fore. We can build transformed parameters that combine known data and unknown parameters into a single object, and then specify the joint probability of that object. An example might be illustrative.

Say we have a vector of continuous outcomes for period

t

Y_{t}

(maybe you can think of this as being a vector of macroeconomic aggregates), and a single binary outcome

d_{t}

(maybe an NBER recession period). Unfortunately we can’t include

d_{t}

in a joint continuous model of

Y_{t}

as joint distributions of discrete and continuous variables don’t exist in Stan. But what we can do is create a continuous parameter

τ_{t}

and append this to our continuous outcomes

Y_{t}

;

τ_{t}

then maps to the discrete outcome

d_{t}

via a probit link function. That is:

// your data declaration here
parameters {
  vector[rows(Y)] tau;
}
transformed parameters {
  matrix[rows(Y), cols(Y)+1] Y2;
  // fill in Y
  Y2[1:rows(Y), 1:cols(Y)] = Y;
  // fill in tau
  Y2[1:rows(Y), cols(Y)+1] = tau;
}
model {
  // priors
  // ...
  // our joint model of Y2 (you should probably optimize this)
  Y2 ~ multi_normal(A + BY_{t}, some_covariance);
  
  // the model of the binary outcome
  d ~ bernoulli(Phi_approx(col(Y2, cols(Y) + 1)));
}

By specifying the model this way, working out the conditional probability of an NBER-defined recession is simply a matter of simulating forward the entire model for each MCMC draw. This will incorporate uncertainty in the parameters, as well as the uncertainty in the future values of the predictors of a recession (provided a VAR(1) structure is correct—it probably isn’t).

Aggregating data

The big advantage of working in industry is the access to incredibly granular data. For instance, the basic unit of data that we receive is a transaction, which is time-stamped. This is fake data, but it might look like this:

date	transaction_amount	customer_id	transaction_type
2016-08-11 05:05:29	102932	11613	pmt
2016-08-14 20:50:29	102902	10277	pmt
2016-08-25 12:06:52	115495	10613	refund
2016-08-31 19:40:58	108238	104	pmt
2016-09-06 03:19:09	80990	66521	pmt

These are incredibly rich data, and are quite typical of what is available in retail or finance. The problem with it is that they contains a lot of noise.

The timing between payments might be meaningful, but often contains no information about future payments.
The precise payment amount on a day might be meaningful, but if customers are invoiced weekly or monthly, we might be missing something. Many customers tend to pay in chunks (several small payments) rather than the monthly due amount, especially in frontier markets where many wages are paid daily.

Because of the noisiness of very granular data, we might more easily extract the signal by aggregating our data. But how much should we aggregate? Should we sum up all payments for each minute (across customers)? For each day or month? Or should we sum within customers across time periods? Of course, your problem will dictate how you might go about this.

One particularly good example of the power of aggregation is in financial markets. In the short run (but not the very short run) equity prices behave as though they’re very close to random walk series. Under this interpretation, the only way you’re going to beat the market is by getting superior information, or perhaps trading against “dumb money” in the market. Yet in the long run, there is an amazing amount of structure in equity prices—in particular, a long-run cointegrating relationship between earnings and prices. This famous observation comes from John Cochrane’s 1994 paper.

In the case of Lendable, we found very early on that aggregating beyond the customer-level was extremely dangerous. In particular, customers tend to “atrophy” in their repayment behaviour as time passes. Customers who’ve been with the firm for longer are more likely to miss payments and default than customers who’ve just signed on. At the same time, the firms we lend to are growing incredibly fast. This means that at any time, there are many new, enthusiastic customers in a portfolio, lowering aggregate loss rates. But these aggregate loss rates don’t actually give much of an insight into the economic viability of the lender, hence the need to model individual repayments.

Your data are not IID. Stan can help

When Lendable started our modeling of individual repayments for debt portfolios, we were using more traditional machine learning tools: trees, nets, regularized GLMs etc. We made the decision to switch over to Stan after realising that modeling individual repayments were quite highly correlated with another. Take two super-simple data generating processes.

The first assumes that shocks to repayment are uncorrelated across people

y 1, i, t = μ + ϵ i, t

The second assumes that every person receives a “time shock” and an idiosyncratic shock

y 2, i, t = μ t + ϵ i, t

with

μ t \sim N (μ, σ μ)

and

ϵ i, t \sim N (0, σ ϵ)

If there are N customers, what is the variance of their cumulative payments under these two specifications?

Within a period, repayments under the first process are distributed

Y 1, t = \sum i = 1 N y 1, i, t \sim N (N μ, N σ 2 ϵ ‾ ‾ ‾ ‾ ‾ \sqrt)

which gives the aggregate payments between

t = 0

and

T

Y 1 = \sum t = 0 T Y 1, t \sim N (T \times N \times μ, T \times N \times σ 2 ϵ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt)

Now how about under the second DGP, in which repayments are correlated? Now we have

Y 2 = \sum t = 0 T Y 2, t \sim N (T \times N \times μ, T \times N \times (σ 2 ϵ + σ 2 μ) ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt)

That is, under the first DGP, our uncertainty around future repayments is smaller than under the second DGP, by the ratio of the scales

T \times N \times σ 2 ϵ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt T \times N \times ( σ 2 ϵ + σ 2 μ ) ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ ‾ \sqrt

A back of the envelope rule of thumb we use is that the scale of portfolio shocks is roughly half the sale of individual shocks, meaning that methods which do not account for the correlation massively under-count observed volatility.

σ2μ
σ2μ

Now the models we use at Lendable aren’t quite as simple as this example, but we adhere to the general gist, modeling portfolio shocks independently from idiosyncratic shocks. The problem we found before migrating to Stan was that essentially all black-box machine learning techniques assume conditional iid data (a la the first model), and so our predictive densities were far too confident. If you’re making pricing decisions based on predictive densities, this is highly undesirable—you are not pricing in known risks, let alone unknown risks.

After realising that essentially no typical machine learning methods were able to take into account the sorts of known risks we wanted to model, we started looking for solutions. One early contender was the lme4 package in R, which estimates basic mixture models (like model 2 above) using an empirical-Bayes-flavoured approach. The issue with this was that as the dimension of the random effects gets large, it tends to underestimate

σ_{μ}

, which kind of defeats the purpose.

The other issue was that standard mixture models assume that the data are conditionally distributed according to some nice, parametric distribution. When you’re modeling loan repayments, this is not what your data look like.

Thinking generatively affects how you think about the model specification

The mindset for many in the “data science” scene, especially in finance, is “how can I use machine learning tools to discover structure in my data?” I’d caution against this approach, advocating instead a mindset of “what structure might give rise to the data that I observe?” Lendable’s experience in modeling small loans in frontier markets suggests that this latter approach can yield very good predictions, interpretable results, and a mature approach to causal inference.

When building a generative model, it’s wise to first plot the data you want to model. For instance, weekly repayments on loans might have a density looks like the black line in the following chart:

You can see that it is a multi-modal series. There’s a clear spike at 0—these loans have no fixed repayment schedule, so it’s quite common for people to pre-pay a big chunk in one month, then not pay in future months, or to not pay during periods in which they don’t have much money (say, between harvests). There are also two different modes, and a very long right hand tail.

When you look at a density like this, you can be pretty sure that it’ll be impossible to find covariates such that the residuals are normal. Instead, you have to do some deeper generativist reasoning. What this density suggests is that there are people paying zero (for whatever reason), some others paying at the modes—these folks are actually paying the “due amount”—and some other people paying much more than the due amount.

One extremely useful technique is to decompose the problem into two nested problems: the first is a model of which group each belongs to, the second is their predicted payment conditional on the group they’re in. This isn’t my idea; Ben Goodrich taught it to me, and it has been written about in Rachael Meager’s excellent job-market paper.

So how do you go about implementing such a model in Stan? The basic idea is that you have two dependent variables: one tells us the “class” that the observation belongs to (expressed as an integer), and the other the actual outcome.

data {
  // ...
  int class[N];
  vector[N] outcome;
}
//... 
model {
  // priors here
  
  class ~ categorical(softmax(mean_vector));
  
  for(n in 1:N) {
      if(class[n]==1) {
        outcome[n] ~ my_model_for_class_1();
      } else if(class[n]==2) {
        outcome[n] ~ my_model_for_class_2();
      }
      // ... 
  }
}

Of course we could implement something like this outside of Stan, by modeling classes and conditional outcomes separately, but if we’re using random effects in each model and want to estimate their correlation (yes, we want to do both of these things!) we have to use Stan.

Treatments affect a distribution, not just its expected value

A big part of the modeling in industry is evaluating the impact of interventions using randomized control trials (in industry, known as A/B tests). Sadly, it is common to use very basic estimates of causal impacts from these tests (like difference in means, perhaps “controlling” for covariates). These commonly used techniques give away a lot of information. A treatment might affect the expected value of the outcome, but is that all we’re interested in?

Using the example of small loans repayments, what if the impact of a lender’s call-center management impacts repayments, but mainly by changing the number of people paying zero in a given month. Or perhaps the treatment affects customers’ probability of making a large payment? Or maybe it convinces those who were going to make a large payment already to make a larger payment? Building a generative model like the one discussed above and including an experimental treatment in each sub-model allows us to not only get more precise estimates of the treatment effect, but also to understand how the causal impact of a treatment affects the outcome.

Conclusion

At Lendable we care very much about making investment decisions given uncertainty. Models will tend to not include risks that haven’t taken place in the historical data. But we see models all the time that don’t take into account the information we have in the data, let alone any of these Black Swans. Stan really makes building models with this property quite easy. If you’re keen on starting to learn, then get in touch, and I’ll give you a project!

I have many.

Modern Statistical Workflow