Today’s booming data science landscape presents countless opportunities to learn and implement machine learning algorithms. Newbies to the fascinating world of machine learning will find at their disposal a mind-boggling number of learning resources, tutorials, open source tools and public datasets.
Can it get overwhelming? Yes.
If you’ve been dipping your toes in the sea of machine learning algorithms and if you’re now ready to dive right into it, there’s only one way to do it: implement! This article provides a 5-step process to help you structure your approach to your next machine learning use case.
Step 1: Pick your algorithm
Start with selecting an algorithm that you want to learn and implement. If supervised learning interests you, there are a host of regression and classification algorithms to choose from. If you want to get started with unsupervised learning, you can experiment with clustering or dimensionality reduction techniques. If you aren’t sure where to begin, here’s a handy guide to get you started thinking about various algorithms.
Spend some time understanding the algorithm at a high level. Look for video tutorials, summaries and blogs by thought leaders. Leverage the large and helpful online machine learning community through helpful discussions like the ELI subreddit or other data science threads on platforms like Quora and Cross Validated. Ask yourself some key questions to build your intuition around the model: In what context was the algorithm developed? Is it popular in certain industries, or for some specific tasks? What kind of problems is it capable of solving?
By the end of this step, you should be able to explain what the algorithm can do, as well as its limitations.
Step 2: Gather your toolkit: Problem, data and language
Now that you know what the algorithm can and can’t do, pick up a problem statement that you can try to solve with it, as well as a dataset that fits with the problem. Try to reconcile the algorithm with the data by answering some key questions. Does the algorithm need labelled data? If labelled, is the data imbalanced? How many features can it work with?
Decide on the programming language you’d like to use in implementation. The most recent KDnuggets poll places Python as the most popular choice, with other options like R and Java not too far behind. If you’re new to programming languages, Python is a great choice for beginners since it is built for readability and ease of use and has plenty of standard libraries for data science. For some inspiration to get you started, head over to Project Jupyter’s great repository of Github collection of Jupyter/IPython notebooks on a wide variety of machine learning and statistical content.
Step 3: Understand your algorithm
Armed with your data and a clear end goal, you should now start researching the algorithm in depth. At this stage, you can start reading influential research papers on the algorithm, as they will give you a baseline idea of the algorithm upon which you can build deeper knowledge. You can supplement this with useful resources on sites like arxiv-sanity, where you can find thumbnail previews and abstracts. This may seem challenging at first, but you’ll soon develop the ability to grasp the key aspects of papers. You can deepen your mastery of the algorithm over several iterations, starting with skimming the headers and tables first and progressing to read the text and math in subsequent passes. For most pioneering papers, you’ll find helpful summary articles and explanatory blogs. Look for these to speed up your learning process.
An excellent way to wrap your head around algorithms is by going through Github repositories to understand how other data scientists have approached a problem. It gives perspective on how an algorithm can be adapted to different contexts and is also a great way to engage with the wider data science community.
Step 4: Code
With your recipe in place and all the ingredients in place, you can now confidently start coding up your algorithm. Most programming languages will provide standard libraries for modelling tasks, so you’re rarely find yourself needing to code the details of the algorithm from scratch. You will however find that you’ll need to make several decisions during implementation, with regards to architecture, hyperparameters, feature engineering and more. This is where you can draw on your understanding of the algorithm to tackle these choices. At the same time, you can unleash your creativity by experimenting with the features in your data, finding useful interactions and drawing upon any apriori knowledge you have of the problem or domain.
While building your algorithm workflow, make sure your code is well-commented so that other data scientists can collaborate with you if you eventually open source your project. This will also help you document your flow of thought and make debugging a lot easier.
By the end of this step, you’ll have supplemented your (relatively) passive learning's from papers, code documentation and online tutorials with your active learning's from actually coding and running the algorithm yourself.
Step 5: Iterate
This is the final and often the most satisfying stage of implementing your algorithm. Evaluate how well your algorithm performs on various parameters like speed, accuracy, complexity and interpretability. Go back to your code and tweak it to move closer to your end goal. If you really want to scale your learning, try replicating your work in another programming language or on a similar dataset. This will help you get an idea of how generalizable your algorithm is, and you’ll internalize your learning's very effectively. Fine-tuning your project in an iterative manner provides a deeper grasp of the subtleties of the algorithm.
A bonus step in your learning process would be the addition of a feedback loop from peers, if you can get it. An outside-in perspective on your work is always a great way to spot any potential gaps that you may have missed. Not to mention that, any positive affirmations from a fellow data scientist on your work can’t hurt!
Implementing a machine learning algorithm is a self-reinforcing process. It’s the best way to really learn the in’s and out’s of any technique. Over time, you’ll identify and excel in your areas of expertise, and you will also build a pipeline that you can reuse for future projects.