Monday, November 14, 2016

A bag of tips and tricks for dealing with scale issues

Often when building models we come across scale issues. Some common examples:
  • Some columns of our data might be different orders of magnitude to others, for instance gender might be encoded as a binary variable, while income might be stored in whole-dollar amounts.
  • An outcome might be extremely rare.
  • The impact of variable X1 on Y could be orders of magnitude greater than the impact of variable X2. That is, it’s not the scale of the data that is causing issues so much as the scale of the parameters.
Scale issues like these can have a big impact on the quality of your model fit, both in terms of the computational efficiency in fitting the model, and the quality of the estimates/predictions coming from the model. They also impact the interpretability of estimates from the model. It’s extremely difficult for non-technical consumers of model output to interpret regression output when the variation in your features are different orders of magnitude.
Below are a few tricks that we often use to deal with scale problems.

Understanding why scale causes problems

If our data contain variables with wildly different scales, can affect computation for many model types (most tree-based being exceptions). Take for example the diamond dataset, which comes packaged with R. We’ll use the last 10,000 rows, which contain the most costly diamonds that are likely to give us scale issues. Let’s build a very simple mixed model that tries to predict the weight of the diamond given its characteristics. (This is a near useless thing to model, but does give us the numerical problems we’re after.)
# Load lme4 and the data
library(lme4)
data("diamonds", package = "ggplot2")

# Grab the priciest diamonds
diamonds_subset <- diamonds[(nrow(diamonds)-10000):nrow(diamonds),]
# Fit the model
fit_1 <- lmer(carat ~ depth + table + price + x + y + z + (1 + price | cut), data = diamonds_subset)
## Warning: Some predictor variables are on very different scales: consider
## rescaling
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : Model failed to converge with max|grad| = 11.7977 (tol =
## 0.002, component 1)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?
As you can see, it throws all sorts of warnings to tell us that the model has not converged. If this is the case, we should not trust the results from the model fit. If the scale issue is big enough, it will throw an error and not return anything at all. Helpfully, the lme4 package recognizes scale problems and tells us to rescale our values. Let’s try rescaling and see what happens.
# Let's try dividing price by 1000
fit_2 <- lmer(carat ~ depth + table + I(price/1000) + x + y + z + (1 + I(price/1000) | cut), data = diamonds_subset)

# Now let's try taking the natural log
fit_3 <- lmer(carat ~ depth + table + log(price) + x + y + z + (1 + log(price) | cut), data = diamonds_subset)

Hey presto! The model converges. Why does this work?

There are two common operations used to fit models like this (sometimes both at the same time): optimization and matrix inversion.
Numerical optimization works by varying a set of parameters until it finds a minimum value to some loss function—such as a (negative) log likelihood, or a measure of impurity like entropy. To find a minimum point, most optimization methods evaluate the gradient of the loss function around the current values of the parameters; if the gradients is less than a threshold, the optimizer stops. You can think of this as taking a marble, dipping it in molasses, and rolling it into a soup bowl (with no soup); the co-ordinates of the marble when it reaches the bottom is the optimized value of the parameters.
When we have a scaling problem, the soup bowl gets stretched out, making the gradient of the loss function extremely shallow in one dimension but not another. In this situation, a fixed stopping rule will not be appropriate across dimensions. In particular, the gradient along the “stretched” dimension will tend to be shallower than along the un-stretched dimension; we should consequently have a lower threshold for the gradient in the stretched dimension. Some modern optimization algorithms take account of this, but nevertheless you should be aware of the issue.
Scale problems can also make it difficult to invert matrices, as illustrated in the following example.
# a positive definite matrix
a <- matrix(c(1, 0.5, 0.5, 1), 2, 2)
# Easy to invert
solve(a)
##            [,1]       [,2]
## [1,]  1.3333333 -0.6666667
## [2,] -0.6666667  1.3333333
# Make a copy of it and scale the diagonal
b <- a
diag(b) <- diag(b)*1e3
solve(b)
##               [,1]          [,2]
## [1,]  1.000000e-03 -5.000001e-07
## [2,] -5.000001e-07  1.000000e-03
# We only increased the scale by three orders of magnitude, yet the 
# off-diagonals have gone to zero far more quickly
# you can see where this is going

# Let's make a big positive definite matrix
c <- rWishart(1, 12, diag(10))[,,1]
# and scale one of its diagonal elements by 1e15
c[4,4] <- c[4,4]*1e15
try(solve(c), silent = F)
# Ruh Roh
How common is this situation? Take linear regression where we have one column of our  matrix being in the billions and everything else in of order of magnitude 1. In the normal equations we use —the diagonal elements of  are the sums of squares of the columns of X. If one of the columns of X is large in absolute value relative to the others, you will get these sorts of scale issues.

But I’m a Bayesian and I’m sampling from the posterior, not optimizing. Does this still matter?

Yes!
Poor scaling can really hurt time spent warming up, and even affect the performance of the sampler (without taking into account possible numerical issues). Hamiltonian Monte Carlo works by sending a particle (our parameter vector) whizzing around the posterior surface. In order to make sure it covers the parameter space efficiently, it requires a well-chosen covariance matrix for the momentum of this particle. The beautiful thing about Stan is that it tunes this covariance matrix during warm-up. But scale issues make it difficult for Stan to find a good covariance estimate. Ensuring that the scales of our unknowns are all of order of magnitude 1 will mean Stan can estimate a good covariance matrix more quickly.

Dealing with scale problems

Here are a few tips to deal with scale issues.

Divide (and conquer!)

An extremely simple technique to deal with scale issues is to simply divide or multiply the variables giving you trouble, as we did in the lmer example above. This has the advantage of being extremely easy to implement, with the one downside that you have to remember how much each variable was scaled by later on in your analysis.
A very common technique is to scale an offending variable into its “z score” , where  is the expected value of  and  is its standard deviation (we use in-sample estimates for both). This scaling has an advantage of making coefficients interpretable (the response due to a 1- shift in x), and also to deal with the scale issues discussed above.
If this sounds like you’re getting something for nothing, this is probably right. If we’re rescaling our data based on its own values and using informative priors, the priors are implicitly informed by the data. This is not good practice.

Take logs

One useful way of putting variables onto a scale close to whole units is to take their natural log. This can also aid in the interpretability of parameters. Take a time-series  with an initial value of  and a continuous compounding growth rate of . Its value at any time  is


Taking the natural logs of both sides gives


Thus if we want to take the time differential (the change in  over a period of time), we get


That is, the slope of  over time is the continuous compounding growth rate of —handy! A similar argument can be used to show that for a log-log regression with a truly exogenous 


the value of  is approximately equal to a constant elasticity—the % response in  to a 1% change in . The lesson is: logs are your friend, given they both help with the technical challenges with scaling problems, and also make coefficients more interpretable.

Get comfortable with the QR decomposition

Many of you will be familiar with the analytical solution to the linear regression model, the normal equation . But what happens if, as in the example above,  contains a column that is very big, making the inversion of  impossible with computer precision? To get around this issue, statistics packages don’t use that particular analytical solution. Instead, they use a method that makes use of the QR decomposition of a the  matrix.
The QR decomposition is , where  is an orthogonal matrix the same dimensions as X, and  is a triangular matrix that contains all the scale and correlation information from . It turns out that substituting the QR matrix into the normal equation and making use of the fact that , the system becomes solvable for a larger range of values—and no matrix inversion is required.
Not all your problems will be able to use the QR decomposition as neatly as with linear least squares regression, but it it a useful tool in your belt.

An example using scale-free parameters

Suppose that we are interested in estimating the proportion of a rare disease in the population, say Y. If we try to model Y directly, we would need a density that is heavily skewed towards 0. As a first approximation, we might think about


This is tempting, however we will likely run into a number of issues:
Computational instability is unstable and may cause numerical instabilities and divergent transitions. We want to design the model such that the geometry of the posterior is easy for HMC to navigate. As a general rule, we’d like to avoid extremely large or extremely small parameters such as .000001 or 100000. If the effect varies on the order of 10e4 in one direction and 10e-4 in the other, then it will take 10e8 steps to explore the entire state space which may be both intractable and cause numerical instabilities. If everything is of order 1, however, then it will take order 1 steps to jump to a new point, allowing Stan to explore the state space much more efficiently.
Difficulties in interpreting effects / coefficient values. If we are not sure whether an effect exists, it is much more difficult to interpret its magnitude on the scale of 0.0001 versus on a scale of [0, 1].
So how do we model this ?
Method 1: scale the data
A naive approach would be to rescale the data to calculate the rate per 10,000 people. However If we make transformations to the data, it’s important not to implicitly be informing the prior with data. See here)
Method 2: scale the parameters
Another approach is to scale the parameters so that they are “scale-free”, or alternatively, “unit scaled”. This is in the same spirit as scaling covariates that have different units in, say, a regression: If one covariate is house price and the other is age, the house price will have a much larger impact on the response variable due purely to its scale.
Once we’ve identified the appropriate units of the problem, we can then re-define our system of units so that all of the expected effects are all around 1 or between 0-1 in those units. Scale-free parameters should typically (although not necessarily) be in the (0, 1) range in absolute value. This suggests that N(0,1) could be a good candidate for a weakly informative prior.
Setting “good priors”
Assume  is some parameter in the model. We set a fairly uninformative prior on  as we are not sure whether an effect exists. If it does, we will have some idea of its scale. So the prior might take the form


and


If all of the scales are order 1, then all priors will look the same and we have much less bookkeeping to do.
Informative Priors
Now say we have good reason to believe the average effect of  is expected to be 4.5. We would be tempted to set a weak informative prior, say,


A better solution would be to transform  into a scale-free parameter by scaling by the value it typically takes, in this case, 4.5. So instead of , we want to work with . Incorporating this knowledge into the model turns the weakly informative prior into something much more interpretable.
Alternatively, we can specify  as a transformed parameter. In Stan syntax:
transformed parameters {
    real theta;
    theta = 4.5 + theta_raw;
}
model {
   theta_raw ~ normal(0,1)
}
Final Checks
Finally, although we may believe our knowledge is accurate, we still need to verify that the fit is good and the prior mean doesn’t create tension with the rest of the model. For instance, we check that that all other parameters depending on this transformation make sense, and parameter limits are still valid.