Modern Statistical Workflow: August 2017

A fixture in online discussion of data science is the causal inference vs predictive analytics debate. Where should a bright young person be focusing their efforts? Here’s Francis Diebold. Here’s Yanir Seroussi. There are dozens more.

What is often missing from these debates is another important sub-field: modeling as measurement. I’ll get to that in a sec, but let me first describe how I think about how data analysis (as catholic a grouping as I can imagine) can add value.

A hierarchy of value add

Organizations distinguish themselves by strategy. We see firms of similar ages with similar-quality operations, with some firms being far more successful than others. Smart-phones, fast-food outlets etc. They can’t be totally inept at operations, but high-quality operations don’t seem to be necessary to make a mountain of money. If you’ve called a telecom recently, you know this to be true.

So how do various fields of data science map to the value-driving functions of an organization? A crass simplification, sure, but to my mind it looks like this:

Sub-field	Operational/tactical/strategic	Difficulty
Business intelligence	Tactical/strategic	Easy
Predictive modeling	Operational/tactical	Easy-difficult
Experimental causal inference	Tactical	Moderately difficult
Structural causal inference	Strategic	Extremely difficult
Modeling-as-measurment	Strategic	Moderately difficult

By “business intelligence” I mean “writing smart database queries/manual data entry and making pretty plots”. The barriers to entry in this field are pretty low—especially since the advent of R’s tidyverse and the Tableau BI platform—and the potential upside, letting senior decision-makers understand what’s going on, is enormous. Many intelligent folks snobbishly look down on this field, and they’re stupid to do so. People running companies and countries do not look down on BI.

I’ll skip over predictive modeling, which you know all about. By experimental causal inference I mean “A/B” testing, as it is known in industry, or randomized control trials in science. At first glance it mightn’t seem to be that difficult. Yet there are subtleties. Sometimes your treatment group doesn’t take the treatment, and control-group folks seek out the treatment. You get big selection-into-study effects, etc. So doing it properly is pretty tough. Causal inference does give us promise of being able to make decisions based on some notion of science, improving management decisions. Yet most applications are very much tactical (“which demographic groups should we target for this campaign?”, “what aid interventions actually affect schooling levels”). Of course, getting tactics right are important, but they’re a small part of a big picture. Singapore and South Korea didn’t get rich by running RCTs on every policy.

Structural modeling (I’ve called it “structural causal inference” above) is when you try to identify deep parameters of decision-makers’ optimization problems (like risk aversion or discount rate or preference structure). The notion is that if you’ve got a good model of how people and companies make decisions, then you can draw inference about how they might behave in situations that are quite dissimilar to history, and respond accordingly. That’s almost by definition what strategy is. Structural modeling field can plausibly add a lot of value. Yet you have to be a certified genius to understand it, and none of the many excellent structuralists I know would be able to sell any of their (excellent) ideas to a board of non-specialists. Time spent learning how to impress other structuralists seems to come at a 1:1 cost of time spent learning how to influence anyone else.

If you are smart, you prioritize. Spending cognitive resources on operational-level stuff just seems a waste. All the easy questions are (or will be) automated, and the hard questions—natural language and image processing—are akin to the difference between different models of Samsung cellphones, not the difference between Apple and Samsung. You want to get people who make important decisions to make good decisions.

One way to do this is by beefing up on good BI skills, but low barriers to entry decrease the wages and prestige in doing this. If you’re very bright, you might consider becoming a structuralist. The danger in doing that is in falling in love with the models rather than the world. Or you could learn modeling as measurement.

Modeling as measurement

Almost none of the data I work with maps cleanly to the things I care about. For context, the data I work with is mostly mobile money payments made to asset-backed lenders in East Africa. What I care about is whether their portfolio of loans is getting better or worse (we lend them money as a wholesale lender). This is actually a difficult problem.

The probability of a loan turning sour changes during the duration of a loan. It’s quite typical that borrowers make higher payments at the beginning of the loan, but the relationship is non-linear, and varies across individual borrowers (whose probabilities are almost certainly correlated).
The lenders who we lend to are growing very quickly. Each month they have more new, eager borrowers who make good payments (for the first few months at least).
Lenders often introduce new products, different loan terms, occasionally have product defects/recalls, expand into new geographies with less-experienced management etc.

The combination of all these factors makes it surprisingly difficult to work out whether a lender’s portfolio is improving or not. The typical BI answer would be to break the problem down into sub-problems, say, by analyzing different cohorts or different products or different geographies. Banks typically do this by coming up with Key Performance Indicators (KPIs) that they can quickly calculate for sub-groups. This is a helpful approach. Yet it leaves a lot on the table. To get the most from your data, you have to build a model.

In the “modeling as measurement” world, you’re not really doing prediction or causal inference—though you should be building models that can do both. You are building models of your observed variables as a function of the unobserved variables that you actually care about. If you build a good model of your observed data as a function of the unknowns, then it’s fairly straightforward to use the model to make predictions of the way things actually are, abstracting from the measurement problems in your data. That sort of information is profoundly useful to decision-makers.

A good example of this is a state-space model, often described using the application of the Kalman filter to space flight. Picture a spacecraft zooming to the moon. You want to work out its location in some space and its velocity, and tell it where to turn (if necessary). How do you work out where the spacecraft is? One way would be to simultaneously point two or more telescopes at the spacecraft at the same time, and use trigonometry. The problem with this is that it returns only an imprecise estimate of the location of the spacecraft. Can we do better?

The genius of the state-space approach is that we can combine sources of information. First, we have a sequence of noisy estimates of the location of the spacecraft. Second, we have the knowledge that, undisturbed, a space-craft zooming through space won’t be making any turns. We can combine these two sources of information (using the Kalman filter or more modern approaches) in a way that results in far less uncertainty about the (true, unobserved) location and velocity of the spacecraft. That’s profoundly useful—and this is before we even consider the potential use of state-space models for prediction or (typically structural) causal inference.

The thing that I find strange about this is that when I talk to investors and other decision-makers, they want to know how things are. They don’t trust forecasts—probably a good rule of thumb. And for executive decisions they’re definitely not interested in treatment effects. They want an unbiased description of reality. My bet is that building skills in descriptive modeling—modeling as measurement—is profoundly useful, and will be seen to be in the coming years.

Modern Statistical Workflow

Sunday, August 6, 2017

Modeling as measurement

Jim Savage

6 August 2017