This is a straightforward guide with a generic and efficient method for tackling machine learning projects. Of course, every problem has particularities & subtlties of its own, but here’s a general-purpose framework with the most useful elements of the ML pipeline.
Machine Learning is most certainly the trendiest topic of the last few years. We are all flooded with a permanent flow of speech from Linkedin recruiters (please check out this awesome twitter account), consultancies, software editors, law professionals and self-declared experts of all kinds.
Of course, ML articles must feature inspirational/WTF quotes, robot pictures, Matrix-style images, and tons of #HASHTAGS so here is some fresh ones (Thank you LinkedIn !):
Now that I have your attention, let’s get to it ! :)
With all the hype and fuss around the topic, it might get hard knowing how to start a project and get it right.
Which tool should I use ?
R or Python ? the answer is :
How do I get the job done in an efficient way ?
All examples will be in R, but are easily transposable to Python. Will be more than happy to answer any question in the comments. Also, Qwant is your friend ;-)
1. Start with descriptive statistics & graphics :
Uni, bivariate, multivariate (PCA). Do some dataviz and try answering theses questions about your data :
- What is your data ?
- How does it look like ? Does it have a special shape ?
- Are there any obvious structures or relationships ?
2. Do a fast check for leakage
Data leaks are still weakly known even though they’re actually fairly common. They can cause some funny model behaviour like good performance during development and poor accuracy in production. Leakage happens when a model is built using data that won’t be available when the model is put to use in the real world. There is not very much clear literature on the subject, but here’s a nice resource on the topic.
3. Keep the right features only
Selecting the right features in your data will make the difference between passable performance with long training times and great accuracy with short training times. The steps here are the following :
- Remove Redundant Features. Some features don’t offer any new information. Thou shall delete them.
- Rank Features by importance (using Random Forest for example) to understand what variables are the most linked to the one you are trying to predict.
- Eventually, use a Feature Selection procedure to eliminate useless data. Parcimony is awesome :) you will make your model simple and avoid overfitting. Keep the model generalizable.
4. Choose your model carefully :
Test a few of them. Every model you run will tell you a different story. Stop and listen to it. They are all interesting.
Look at the coefficients. Look at the metrics. Did they change? How much ?
When you pause to do this, you can make better decisions on the model to run next.
Always check if your model can be generalized. Then, check again. Obviously, you must always test on a separate data sample.
Get yourself the right toolbox : Caret on R and sklearn for Pythonistas are essentials that will get you there quite fast !
The research question is central, keep it in mind. Especially when you have a large and rich data set, it’s very easy to get lost or distracted. There are so many interesting relationships you can find. Months later, you’ve tested every possible predictor but you’re not making any real progress.
Keep the focus on your destination: the research question. Write it out on a post-it and stick it on your office or your screen!
You’re all set now. Please leave a comment and give ❤️