Saturday, January 27, 2018

Some good introductory machine learning resources in R

I didn't want to clog up a Twitter thread with a bunch of machine learning blogs/books/vignettes/software, but also thought an email to Scott wouldn't be useful to anyone else. So here are a few relatively-accessible resources that someone with a bit of math should be able to get through with ease. 


(Regularized) generalized linear models

This is an excellent worked vignette for regularize (generalized) linear models using the fantastic glmnet package in R: 

What you'll find is that for prediction, regularized glms with some feature engineering (interactions, bucketing, splines, combinations of all three) will typically give you similar predictive performance to random forests while maintaining interpretability and the possibility of estimating uncertainty (see below). That's why they're so popular. 

When you have high-dimensioned categorical predictors or natural groupings, it often doesn't make sense to one-hot encode them (ie. take fixed effects) in a regularized glm. Doing so will result in the same degree of regularization across grouping variables, which might be undesirable. In such a case you can often see huge improvements by simply using varying intercepts (and even varying slopes) in a Bayesian random effects model. The nice thing here is that because it's Bayesian, you get uncertainty for free. Well not free--you pay for it in the extra coal and time you'll burn fitting your model. But they're really pretty great. rstanarm implements these very nicely. 

In the above two methods, if you want to discover non-linearities by yourself, you have to cook your own non-linear features. But there are methods that do this quite well, while retaining the interpretability of linear models. The fantastic mgcv and rstanarm packages will fit Generalised Additive Models using maximum likelihood and MCMC-based techniques respectively. 

https://github.com/noamross/2017-11-14-noamross-gams-nyhackr/blob/master/2017-11-14-noamross-gams-nyhackr.pdf is a fun introduction

and 

https://m-clark.github.io/docs/GAM.html

is a full vignette on implementation using various GAM packages. 

Tree-based methods

The obvious alternatives to regularized glms are tree-based methods and neural networks. A lot of industry folks, especially those who started life using proprietary packages, use SVMs too. Pedants who enjoy O(n^3) operations seem to get a weird kick out of Gaussian Processes. The point of all these methods is the same: to relax (or really, to automatically discover) non-linear relationships between features and outcomes. Tree-based methods and neural networks will also do well at discovering interactions too. Neural networks go a step further and uncover representations of your data which might be useful in themselves. 

To get a good understanding of tree-based methods, it makes sense to start at the beginning--with a simple classification and regression tree. I found this introducton pretty clear: 

https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

Once you understand CART, then Random Forests are probably the next step. The original Breiman piece is as good a place to start as any: 

https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf

Next you should learn about tree-based additive models. These come in many varieties, but something close to the current state-of-the-art is implemented using xgboost. These techniques combined with smart feature engineering will work extremely well for a wide range of predictive problems. I incorporate them into my work to serve as a baseline that simpler models (for which we can get more sensible notions of uncertainty) should be able to get close to with enough work.

https://xgboost.readthedocs.io/en/latest/model.html


Net-based methods

Neural networks are of course all the rage, yet it's helpful to remember that they're really just tools for high-dimensional functional approximation. I found them hard to get into coming from an econometrics background (where notions like "maybe we should have more observations than unknowns in the model" are fairly common). But there are really just a few concepts to understand in order to get something working. 

I found David Mackay's chapters on them to be extremely easy to grasp. His whole, brilliant book is available for free here, with the relevant chapters starting at page 467: 

http://www.inference.org.uk/itprnn/book.pdf

Given you have some understanding now of what a neural network is and how they're fit, you can get down to fitting some. There are a few great high-level approaches, like Keras and H2O.ai, which are extremely easy to dive in with:

https://keras.rstudio.com

and

http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/Ruser/Rinstall.html

Note that these two approaches are great for fairly simple prediction tasks. If you want to make any real investment in deep learning for image/voice/NLP then you will find yourself working at a lower level (the analogy for statisticians would be going from rstanarm/brms to Stan proper), like Torch or TensorFlow. At this point you would probably be wise in asking yourself what you're doing in R--almost the entire AI community uses Python.

Even so, there is a reasonable API for TensorFlow available within R. I've not done a huge amount of playing outside of the tutorials, which seem well written.

https://tensorflow.rstudio.com/tensorflow/

Others? 

If you know of any other great resources for someone--especially an economist--wanting to build their machine-learning chops, please drop them in the comments! 

9 comments:

  1. Alex S writes:

    That's supervised learning, not ML. So you need things like

    https://sites.google.com/site/igorcarron2/matrixfactorizations

    and maybe

    https://arxiv.org/abs/1801.01586

    For smoother transition

    http://mlg.eng.cam.ac.uk/zoubin/papers/lds.pdf

    may help. And then I'd strongly recommend

    http://castlelab.princeton.edu/html/Papers/Powell-UnifiedFrameworkStochasticOptimization_July222017.pdf

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete

  3. I loved reading your post because of clear font style and size.Very helpful content for readers.Thanks for posting.Enroll in big data courses and get placement assistant.Big data refers to the large, diverse sets of information that grow at ever-increasing rates
    big data training institute in btm

    ReplyDelete
  4. The content and the subject in the article are straight to the point and also very clear. Would like to know more such information related to same subject.

    data science training in aurangabad
    data science course in aurangabad

    ReplyDelete
  5. Top Website design service company in Brampton
    Best web design and development company in Toronto
    Developing from a group of specialists to an undeniable top SEO office in Canada, we have confidence in persistently redesigning ourselves with the goal that we can offer the most recent types of assistance to our customers.

    ReplyDelete
  6. Attend The Machine Learning Courses in Bangalore From ExcelR. Practical Machine Learning courses in Bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Machine Learning courses in Bangalore.
    Machine Learning courses in Bangalore

    ReplyDelete
  7. Enjoyed reading the article above, really explains everything in detail, the article is very interesting and effective. Thank you and good luck for the upcoming articles Python Programming Course

    ReplyDelete