Sunday, August 6, 2017

Modeling as measurement

A fixture in online discussion of data science is the causal inference vs predictive analytics debate. Where should a bright young person be focusing their efforts? Here’s Francis Diebold. Here’s Yanir Seroussi. There are dozens more.
What is often missing from these debates is another important sub-field: modeling as measurement. I’ll get to that in a sec, but let me first describe how I think about how data analysis (as catholic a grouping as I can imagine) can add value.
A hierarchy of value add
Organizations distinguish themselves by strategy. We see firms of similar ages with similar-quality operations, with some firms being far more successful than others. Smart-phones, fast-food outlets etc. They can’t be totally inept at operations, but high-quality operations don’t seem to be necessary to make a mountain of money. If you’ve called a telecom recently, you know this to be true.
So how do various fields of data science map to the value-driving functions of an organization? A crass simplification, sure, but to my mind it looks like this:
Sub-fieldOperational/tactical/strategicDifficulty
Business intelligenceTactical/strategicEasy
Predictive modelingOperational/tacticalEasy-difficult
Experimental causal inferenceTacticalModerately difficult
Structural causal inferenceStrategicExtremely difficult
Modeling-as-measurmentStrategicModerately difficult
By “business intelligence” I mean “writing smart database queries/manual data entry and making pretty plots”. The barriers to entry in this field are pretty low—especially since the advent of R’s tidyverse and the Tableau BI platform—and the potential upside, letting senior decision-makers understand what’s going on, is enormous. Many intelligent folks snobbishly look down on this field, and they’re stupid to do so. People running companies and countries do not look down on BI.
I’ll skip over predictive modeling, which you know all about. By experimental causal inference I mean “A/B” testing, as it is known in industry, or randomized control trials in science. At first glance it mightn’t seem to be that difficult. Yet there are subtleties. Sometimes your treatment group doesn’t take the treatment, and control-group folks seek out the treatment. You get big selection-into-study effects, etc. So doing it properly is pretty tough. Causal inference does give us promise of being able to make decisions based on some notion of science, improving management decisions. Yet most applications are very much tactical (“which demographic groups should we target for this campaign?”, “what aid interventions actually affect schooling levels”). Of course, getting tactics right are important, but they’re a small part of a big picture. Singapore and South Korea didn’t get rich by running RCTs on every policy.
Structural modeling (I’ve called it “structural causal inference” above) is when you try to identify deep parameters of decision-makers’ optimization problems (like risk aversion or discount rate or preference structure). The notion is that if you’ve got a good model of how people and companies make decisions, then you can draw inference about how they might behave in situations that are quite dissimilar to history, and respond accordingly. That’s almost by definition what strategy is. Structural modeling field can plausibly add a lot of value. Yet you have to be a certified genius to understand it, and none of the many excellent structuralists I know would be able to sell any of their (excellent) ideas to a board of non-specialists. Time spent learning how to impress other structuralists seems to come at a 1:1 cost of time spent learning how to influence anyone else.
If you are smart, you prioritize. Spending cognitive resources on operational-level stuff just seems a waste. All the easy questions are (or will be) automated, and the hard questions—natural language and image processing—are akin to the difference between different models of Samsung cellphones, not the difference between Apple and Samsung. You want to get people who make important decisions to make good decisions.
One way to do this is by beefing up on good BI skills, but low barriers to entry decrease the wages and prestige in doing this. If you’re very bright, you might consider becoming a structuralist. The danger in doing that is in falling in love with the models rather than the world. Or you could learn modeling as measurement.
Modeling as measurement
Almost none of the data I work with maps cleanly to the things I care about. For context, the data I work with is mostly mobile money payments made to asset-backed lenders in East Africa. What I care about is whether their portfolio of loans is getting better or worse (we lend them money as a wholesale lender). This is actually a difficult problem.
  • The probability of a loan turning sour changes during the duration of a loan. It’s quite typical that borrowers make higher payments at the beginning of the loan, but the relationship is non-linear, and varies across individual borrowers (whose probabilities are almost certainly correlated).
  • The lenders who we lend to are growing very quickly. Each month they have more new, eager borrowers who make good payments (for the first few months at least).
  • Lenders often introduce new products, different loan terms, occasionally have product defects/recalls, expand into new geographies with less-experienced management etc.
The combination of all these factors makes it surprisingly difficult to work out whether a lender’s portfolio is improving or not. The typical BI answer would be to break the problem down into sub-problems, say, by analyzing different cohorts or different products or different geographies. Banks typically do this by coming up with Key Performance Indicators (KPIs) that they can quickly calculate for sub-groups. This is a helpful approach. Yet it leaves a lot on the table. To get the most from your data, you have to build a model.
In the “modeling as measurement” world, you’re not really doing prediction or causal inference—though you should be building models that can do both. You are building models of your observed variables as a function of the unobserved variables that you actually care about. If you build a good model of your observed data as a function of the unknowns, then it’s fairly straightforward to use the model to make predictions of the way things actually are, abstracting from the measurement problems in your data. That sort of information is profoundly useful to decision-makers.
A good example of this is a state-space model, often described using the application of the Kalman filter to space flight. Picture a spacecraft zooming to the moon. You want to work out its location in some space and its velocity, and tell it where to turn (if necessary). How do you work out where the spacecraft is? One way would be to simultaneously point two or more telescopes at the spacecraft at the same time, and use trigonometry. The problem with this is that it returns only an imprecise estimate of the location of the spacecraft. Can we do better?
The genius of the state-space approach is that we can combine sources of information. First, we have a sequence of noisy estimates of the location of the spacecraft. Second, we have the knowledge that, undisturbed, a space-craft zooming through space won’t be making any turns. We can combine these two sources of information (using the Kalman filter or more modern approaches) in a way that results in far less uncertainty about the (true, unobserved) location and velocity of the spacecraft. That’s profoundly useful—and this is before we even consider the potential use of state-space models for prediction or (typically structural) causal inference.
The thing that I find strange about this is that when I talk to investors and other decision-makers, they want to know how things are. They don’t trust forecasts—probably a good rule of thumb. And for executive decisions they’re definitely not interested in treatment effects. They want an unbiased description of reality. My bet is that building skills in descriptive modeling—modeling as measurement—is profoundly useful, and will be seen to be in the coming years.

Sunday, July 16, 2017

A few simple reparameterizations

It’s often convenient to make use of reparameterizations in our models. Often we want to be able to express parameters in a way that makes coming up with a prior more intuitive. Other times we want to use reparameterizations to improve sampling performance during estimation. In any case, the below are some very simple reparameterizations that we often use in applied modeling.
Reparametrizaing a univariate normal random variable
Sometimes we have a normally distributed unknown  with mean (location)  and standard deviation (scale) . We write this as

If we have information about the expected mean and standard deviation for data at the level of the observation, , we could happily incorporate this information like so:

For instance, a normal linear model has , but we could use a variety of functional forms for both.
In such a case, we can always reparameterize  as
In Stan, we’d implement this by declaring  in the parameter block, and  in the transformed parameter block. We then provide distributional information about  in the modeling block but typically use  in the likelihood. For example:


// ... Your data declaration here
parameters {
  vector[N] z;
  // parameters of f() and g()
}
transformed parameters {
  vector[N] theta;
  theta = f(X) + g(X)*z; // f() and g() are vector valued functions that probably have parameters
}
model {
  // priors
  z ~ normal(0, 1);
  
  
  // likelihood
  // ..
}
Reparameterizing a covariance matrix
I always found covariance matrices tricky to think about until I realized how easily they can be reparameterized. We all know how to think about the standard deviation of a random variable. If the growth rate of GDP has a standard deviation of 1%, then that’s very intuitive. How about a vector of random variables? If the vector is (GDP, Unemployment, Rainfall) then each of those random variables has its own standard deviation. Let’s call it . Easy!
Now those random variables might move together. We typically measure (linear) mutual information of a vector of random variables using a correlation matrix. The diagonal of a correlation matrix is 1 (all variables are perfectly correlated with themselves). and the off-diagonals are between -1 and 1, reflecting the correlation coefficient between each random variable. It it symmetric. For instance, the element  is the correlation between GDP and rainfall.
We have everything we need now. The covariance matrix  is simply:
The cool thing about this parameterization is that often our likelihood calls for a covariance matrix (for instance, if we were jointly modeling the three random variables), but we find it easier to provide prior information about the (marginal) scale and correlation between the variables.

We’d implement this in Stan by declaring  and  as parameters, then  as a transformed parameter. We then provide priors for  and , but use  in the likelihood. We often use the LKJ distribution as a prior for correlation matrices.
parameters {
  vector<lower = 0>[3] tau;
  corr_matrix[3] Omega;
}
transformed parameters {
  matrix[3, 3] Sigma;
  Sigma = diag_matrix(tau)*Omega*diag_matrix(tau);
}
model {
  // priors
  tau ~ student_t(3, 0, 2);
  Omega ~ lkj_corr(4);
  
  // likelihood
  // expression involving Sigma
}
Reparameterizing multivariate normals
You’ll notice in the reparameterization  for some  that  is the square root of the variance of . Multivariate normal distributions are typically parameterized in terms of their variance covariance matrix, which is the analog to the variance of a univariate normal. But if we want to apply our intuition from the above reparameterization, we need the “square root” of this covariance matrix.
There are many such “square roots” of positive definite matrices; one is the Cholesky factorization
where L is a lower triangular matrix with the same dimensions as . If we have such an , we can very easily apply the reparameterization at the top. For some vector


then we can also say that

Another convenient take on this is to use the fact that if

where  is the Cholesky factor of the correlation matrix, then we can use the parameterization

This parameterization requires less fiddling than the one above. Stan also gives us an LKJ prior distribution for the Cholesky factors of correlation matrix
In Stan, we implement this by declaring   and  as parameters, and  as a transformed parameter.
parameters {
  vector[K] mu;
  vector<lower = 0>[K] tau;
  vector[N] z[K];
  cholesky_factor_corr[K] L_Omega;
}
transformed parameters {
  matrix[K, K] L;
  vector[N] Theta[K];
  L = diag_pre_multiply(tau, L_Omega);

  for(n in 1:N) {
    Theta[n] = mu + L * z[n];
  }
}
model {
  mu ~ normal(0, 1);
  tau ~ student_t(3, 0, 2);
  for(n in 1:N) {
    z[n] ~ normal(0, 1);
  }
  L_Omega ~ lkj_corr_cholesky(4);
  
  // likelihood below, depending on Theta
}
There are of course many other reparameterizations that we use in buiding models, but I tend to use these three daily.