Modern Statistical Workflow: Some good introductory machine learning resources in R

Saturday, January 27, 2018

Some good introductory machine learning resources in R

I didn't want to clog up a Twitter thread with a bunch of machine learning blogs/books/vignettes/software, but also thought an email to Scott wouldn't be useful to anyone else. So here are a few relatively-accessible resources that someone with a bit of math should be able to get through with ease.

(Regularized) generalized linear models

This is an excellent worked vignette for regularize (generalized) linear models using the fantastic glmnet package in R:

https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

What you'll find is that for prediction, regularized glms with some feature engineering (interactions, bucketing, splines, combinations of all three) will typically give you similar predictive performance to random forests while maintaining interpretability and the possibility of estimating uncertainty (see below). That's why they're so popular.

When you have high-dimensioned categorical predictors or natural groupings, it often doesn't make sense to one-hot encode them (ie. take fixed effects) in a regularized glm. Doing so will result in the same degree of regularization across grouping variables, which might be undesirable. In such a case you can often see huge improvements by simply using varying intercepts (and even varying slopes) in a Bayesian random effects model. The nice thing here is that because it's Bayesian, you get uncertainty for free. Well not free--you pay for it in the extra coal and time you'll burn fitting your model. But they're really pretty great. rstanarm implements these very nicely.

https://cran.r-project.org/web/packages/rstanarm/vignettes/glmer.html

In the above two methods, if you want to discover non-linearities by yourself, you have to cook your own non-linear features. But there are methods that do this quite well, while retaining the interpretability of linear models. The fantastic mgcv and rstanarm packages will fit Generalised Additive Models using maximum likelihood and MCMC-based techniques respectively.

https://github.com/noamross/2017-11-14-noamross-gams-nyhackr/blob/master/2017-11-14-noamross-gams-nyhackr.pdf is a fun introduction

and

https://m-clark.github.io/docs/GAM.html

is a full vignette on implementation using various GAM packages.

Tree-based methods

The obvious alternatives to regularized glms are tree-based methods and neural networks. A lot of industry folks, especially those who started life using proprietary packages, use SVMs too. Pedants who enjoy O(n^3) operations seem to get a weird kick out of Gaussian Processes. The point of all these methods is the same: to relax (or really, to automatically discover) non-linear relationships between features and outcomes. Tree-based methods and neural networks will also do well at discovering interactions too. Neural networks go a step further and uncover representations of your data which might be useful in themselves.

To get a good understanding of tree-based methods, it makes sense to start at the beginning--with a simple classification and regression tree. I found this introducton pretty clear:

https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

Once you understand CART, then Random Forests are probably the next step. The original Breiman piece is as good a place to start as any:

https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf

Next you should learn about tree-based additive models. These come in many varieties, but something close to the current state-of-the-art is implemented using xgboost. These techniques combined with smart feature engineering will work extremely well for a wide range of predictive problems. I incorporate them into my work to serve as a baseline that simpler models (for which we can get more sensible notions of uncertainty) should be able to get close to with enough work.

https://xgboost.readthedocs.io/en/latest/model.html

Net-based methods

Neural networks are of course all the rage, yet it's helpful to remember that they're really just tools for high-dimensional functional approximation. I found them hard to get into coming from an econometrics background (where notions like "maybe we should have more observations than unknowns in the model" are fairly common). But there are really just a few concepts to understand in order to get something working.

I found David Mackay's chapters on them to be extremely easy to grasp. His whole, brilliant book is available for free here, with the relevant chapters starting at page 467:

http://www.inference.org.uk/itprnn/book.pdf

Given you have some understanding now of what a neural network is and how they're fit, you can get down to fitting some. There are a few great high-level approaches, like Keras and H2O.ai, which are extremely easy to dive in with:

https://keras.rstudio.com

and

http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/Ruser/Rinstall.html

Note that these two approaches are great for fairly simple prediction tasks. If you want to make any real investment in deep learning for image/voice/NLP then you will find yourself working at a lower level (the analogy for statisticians would be going from rstanarm/brms to Stan proper), like Torch or TensorFlow. At this point you would probably be wise in asking yourself what you're doing in R--almost the entire AI community uses Python.

Even so, there is a reasonable API for TensorFlow available within R. I've not done a huge amount of playing outside of the tutorials, which seem well written.

https://tensorflow.rstudio.com/tensorflow/

Others?

If you know of any other great resources for someone--especially an economist--wanting to build their machine-learning chops, please drop them in the comments!