Saturday, January 27, 2018

Some good introductory machine learning resources in R

I didn't want to clog up a Twitter thread with a bunch of machine learning blogs/books/vignettes/software, but also thought an email to Scott wouldn't be useful to anyone else. So here are a few relatively-accessible resources that someone with a bit of math should be able to get through with ease. 


(Regularized) generalized linear models

This is an excellent worked vignette for regularize (generalized) linear models using the fantastic glmnet package in R: 

What you'll find is that for prediction, regularized glms with some feature engineering (interactions, bucketing, splines, combinations of all three) will typically give you similar predictive performance to random forests while maintaining interpretability and the possibility of estimating uncertainty (see below). That's why they're so popular. 

When you have high-dimensioned categorical predictors or natural groupings, it often doesn't make sense to one-hot encode them (ie. take fixed effects) in a regularized glm. Doing so will result in the same degree of regularization across grouping variables, which might be undesirable. In such a case you can often see huge improvements by simply using varying intercepts (and even varying slopes) in a Bayesian random effects model. The nice thing here is that because it's Bayesian, you get uncertainty for free. Well not free--you pay for it in the extra coal and time you'll burn fitting your model. But they're really pretty great. rstanarm implements these very nicely. 

In the above two methods, if you want to discover non-linearities by yourself, you have to cook your own non-linear features. But there are methods that do this quite well, while retaining the interpretability of linear models. The fantastic mgcv and rstanarm packages will fit Generalised Additive Models using maximum likelihood and MCMC-based techniques respectively. 

https://github.com/noamross/2017-11-14-noamross-gams-nyhackr/blob/master/2017-11-14-noamross-gams-nyhackr.pdf is a fun introduction

and 

https://m-clark.github.io/docs/GAM.html

is a full vignette on implementation using various GAM packages. 

Tree-based methods

The obvious alternatives to regularized glms are tree-based methods and neural networks. A lot of industry folks, especially those who started life using proprietary packages, use SVMs too. Pedants who enjoy O(n^3) operations seem to get a weird kick out of Gaussian Processes. The point of all these methods is the same: to relax (or really, to automatically discover) non-linear relationships between features and outcomes. Tree-based methods and neural networks will also do well at discovering interactions too. Neural networks go a step further and uncover representations of your data which might be useful in themselves. 

To get a good understanding of tree-based methods, it makes sense to start at the beginning--with a simple classification and regression tree. I found this introducton pretty clear: 

https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

Once you understand CART, then Random Forests are probably the next step. The original Breiman piece is as good a place to start as any: 

https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf

Next you should learn about tree-based additive models. These come in many varieties, but something close to the current state-of-the-art is implemented using xgboost. These techniques combined with smart feature engineering will work extremely well for a wide range of predictive problems. I incorporate them into my work to serve as a baseline that simpler models (for which we can get more sensible notions of uncertainty) should be able to get close to with enough work.

https://xgboost.readthedocs.io/en/latest/model.html


Net-based methods

Neural networks are of course all the rage, yet it's helpful to remember that they're really just tools for high-dimensional functional approximation. I found them hard to get into coming from an econometrics background (where notions like "maybe we should have more observations than unknowns in the model" are fairly common). But there are really just a few concepts to understand in order to get something working. 

I found David Mackay's chapters on them to be extremely easy to grasp. His whole, brilliant book is available for free here, with the relevant chapters starting at page 467: 

http://www.inference.org.uk/itprnn/book.pdf

Given you have some understanding now of what a neural network is and how they're fit, you can get down to fitting some. There are a few great high-level approaches, like Keras and H2O.ai, which are extremely easy to dive in with:

https://keras.rstudio.com

and

http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/Ruser/Rinstall.html

Note that these two approaches are great for fairly simple prediction tasks. If you want to make any real investment in deep learning for image/voice/NLP then you will find yourself working at a lower level (the analogy for statisticians would be going from rstanarm/brms to Stan proper), like Torch or TensorFlow. At this point you would probably be wise in asking yourself what you're doing in R--almost the entire AI community uses Python.

Even so, there is a reasonable API for TensorFlow available within R. I've not done a huge amount of playing outside of the tutorials, which seem well written.

https://tensorflow.rstudio.com/tensorflow/

Others? 

If you know of any other great resources for someone--especially an economist--wanting to build their machine-learning chops, please drop them in the comments! 

67 comments:

  1. Alex S writes:

    That's supervised learning, not ML. So you need things like

    https://sites.google.com/site/igorcarron2/matrixfactorizations

    and maybe

    https://arxiv.org/abs/1801.01586

    For smoother transition

    http://mlg.eng.cam.ac.uk/zoubin/papers/lds.pdf

    may help. And then I'd strongly recommend

    http://castlelab.princeton.edu/html/Papers/Powell-UnifiedFrameworkStochasticOptimization_July222017.pdf

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete

  3. I loved reading your post because of clear font style and size.Very helpful content for readers.Thanks for posting.Enroll in big data courses and get placement assistant.Big data refers to the large, diverse sets of information that grow at ever-increasing rates
    big data training institute in btm

    ReplyDelete
  4. The content and the subject in the article are straight to the point and also very clear. Would like to know more such information related to same subject.

    data science training in aurangabad
    data science course in aurangabad

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Thanks for the Valuable information.Really useful information. Thank you so much for sharing. It will help everyone.

    Full Stack Training in Delhi
    FOR MORE INFO:

    ReplyDelete
  7. Good post and its very informative too. Thanks for sharing...
    Visit us: java course
    Visit us: Core Java Online Course
    Visit us: Java Online Training Hyderabad

    ReplyDelete
  8. There may be noticeably a bundle to find out about this. I assume you made sure good factors in options also. Cryptocurrency web App Build Exchange Website

    ReplyDelete
  9. After reading your article I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article.
    data science training

    ReplyDelete
  10. Sometimes, blogging is a bit tiresome specially if you need to update more topics. 3d animation companies in karachi

    ReplyDelete
  11. Python is one of the most powerful languages that are simple to master and easy to master. Python is a quantitative field, and AI Patasala is the top choice for Python Training in Hyderabad.
    Python Course Hyderabad

    ReplyDelete
  12. keep it up.If you are Searching for info click on given link
    Mobile Prices Bangladesh

    ReplyDelete
  13. Thanks , I have just been looking for information approximately this subject for a while and yours is the greatest I've found out till now. However, what concerning the conclusion? Are you sure about the source?

    야한소설
    대딸방
    출장안마
    출장마사지
    카지노

    ReplyDelete
  14. Digital marketing is a field with lucrative career options. The options in the career of digital marketing.

    After the pandemic period, digital marketing has emerged with a lot of opportunities across the globe.

    Currently, Delhi is now a great stop for digital marketers and many folks are looking forward to starting a career in digital marketing.

    Parallelly, there are digital marketing institutes that providing high quality training which are Best digital marketing academy in Delhi

    ReplyDelete
  15. Thank you for the post. I will definitely comeback. data scientist course in surat

    ReplyDelete
  16. Hi, I read your whole blog. This is very nice. Good to know about the career in Python Training & Certification. We are also providing various Python Training , anyone interested can Python Courses for making their career in this field .

    ReplyDelete
  17. I will truly value the essayist's decision for picking this magnificent article fitting to my matter.Here is profound depiction about the article matter which helped me more.

    ReplyDelete
  18. 360DigiTMG, the top-rated organisation among the most prestigious industries around the world, is an educational destination for those looking to pursue their dreams around the globe. The company is changing careers of many people through constant improvement, 360DigiTMG provides an outstanding learning experience and distinguishes itself from the pack. 360DigiTMG is a prominent global presence by offering world-class training. Its main office is in India and subsidiaries across Malaysia, USA, East Asia, Australia, Uk, Netherlands, and the Middle East.

    ReplyDelete
  19. Great! Here is the best Machine learning training institute in Delhi that offers the best training with the live projects from certified trainers. It also offers the placements in top IT companies.

    ReplyDelete
  20. Thank you for sharing this coaching-related information with local and global audiences. Primary audiences (Coaching Center, Training Center, Business Coaching Class) who wants coaching software, book a free trial of coaching management software to generate the leads, schedule appointments, inquiry management, payments, and business reports in 2022.

    ReplyDelete

  21. A website designing company in Delhi. The term website designing comprises the layout, appearance and sometimes, management of the content of the website.

    ReplyDelete
  22. Great Post. Very informative. Keep Sharing!!

    Apply Now for Big Data course In Noida

    For more details about the course fee, duration, classes, certification, and placement call our expert at 70-70-90-50-90

    ReplyDelete
  23. This comment has been removed by the author.

    ReplyDelete
  24. Machine Learning Institute in Delhi
    https://www.wikiful.com/@trainingdelhi/should-you-be-worried-about-your-job-if-youre-doing-machine-learning-course
    Get everything and become an expert of Machine Learning. APTRON is the best Machine Learning Institute in Delhi. Machine Learning Training in Delhi Offered by APTRON is the most noteworthy Machine Learning Training anytime Top Quality Trainers, affordable fees, authorized Machine Learning Certification.

    ReplyDelete
  25. Very interesting this article. This is my first time visit here. I found so many interesting stuff in your blog especially its discussion. Thanks for the post! Please visit Here

    ReplyDelete
  26. Really great post, I simply unearthed your site and needed to say that I have truly appreciated perusing your blog entries. I want to say thanks for great sharing.
    Refrigerator repairing services in islamabad

    ReplyDelete
  27. Website Designing Company in Janakpuri contact if you want to make best SEO friendly website. Website Designing Company in Janakpuri

    ReplyDelete
  28. Hey,
    Thanks for sharing this great blog. It contains a lot of information. It is easy to locate a Product Design and Development in india. But hard to choose the best Web Design services like this. All your services look very professional. Keep posting.

    ReplyDelete
  29. Great blog! Your blog is very informative and useful. Data science is currently one of the most popular professions globally.

    Machine Learning

    ReplyDelete
  30. Thank you very much for such an encouraging post. Thank you for sharing this useful information. Python Training in Delhi

    ReplyDelete
  31. Choose Secure Move Packers and Movers for affordable Packers and Movers in India.
    Secure Move provides Best Car Transportation Service in Ghaziabad Contact us for more information.

    ReplyDelete
  32. We will repair your computer Repair Services wardha provide service for PCs from branded manufacturers, but also for various non-branded "assembled" PCs.

    ReplyDelete
  33. This comment has been removed by the author.

    ReplyDelete
  34. If You Want To Study Medicine, Do Not Worry, As Several Places Provide Affordable Medical Study, Including Poland, Georgia, Armenia And Ukraine. If You Want Medical Admission In Armenia , Consult Our Team To Get The Best Solutions

    ReplyDelete
  35. Your blog rocks! I just wanted to say that your blog is awesome. It’s really helped me to chose ms in machine learning usa.

    ReplyDelete
  36. Hey

    I really enjoyed reading your blog. It is very informative information you are providing. WTE Academy is also a very helpful institute for students who are interested in applying to Medical admission in Poland. For a better future, join us

    Thanks for sharing

    ReplyDelete
  37. Thank you for providing this helpful information. Are you a student of GCSE Board and looking for a best tutors for online classes. Our online home tuition program for GCSE board is designed to provide personalized and effective learning to students from the comfort of their homes.
    For more info contact +91-9654271931 | UAE +971- 505593798 or visit Tuition Classes of GCSE

    ReplyDelete
  38. Thanks for sharing this informative article on Some good introductory machine learning resources in R. If you want to Machine learning development company for your project. Please visit us.

    ReplyDelete
  39. In the heart of Gurgaon's technological landscape, APTRON's Data Science Institute in Gurgaon stands as a hub of excellence. Its comprehensive curriculum, expert faculty, practical approach, top-notch infrastructure, placement assistance, and networking opportunities make it a standout choice for individuals aspiring to excel in the field of data science. By choosing APTRON, you're not just enrolling in an institute – you're embarking on a transformative journey toward becoming a proficient data scientist ready to conquer the data-driven world.

    ReplyDelete
  40. This is an awesome post. Really very informative and creative contents.

    Machine Learning Training Institute in Bangalore

    ReplyDelete
  41. Comparta excelente información sobre su blog. Blog realmente útil para nosotros.

    DP-080: Querying Data with Microsoft Transact-SQL

    ReplyDelete
  42. Interessant artikel! Voor degenen die geïnteresseerd zijn in het ontwikkelen van een website om hun machine learning vaardigheden te tonen, kan het bouwen van een goedkope website een geweldige manier zijn om hun portfolio te presenteren. Goedkope website laten maken Als je meer wilt weten over het bouwen van zo'n platform, kan het laten maken van een goedkope website een goede optie zijn. Bedankt voor het delen van deze waardevolle bronnen in R!

    ReplyDelete