4) Popular ML algorithms
Now, we’re going to review the most popular ML algorithms. For each algorithm, we’ll talk about its advantages and drawbacks.
Linear Regression is a regression algorithm (but you probably figured that out 🙂 ). This algorithm’s principle is to find a linear relation within your data (Figure 10). Once the linear relation is found, predicting a new value is done with respect to this relation.
Main advantages :
• very simple algorithm
• doesn’t take a lot of memory
• quite fast
• easy to explain
• requires the data to be linearly spread (see « Polynomial Regression » if you think you need a polynomial fitting)
• is unstable in case features are redundant, i.e if there is multicollinearity (note that, in that case you should have a look to « Elastic-Net or « Ridge-Regression »).
The Decision Tree algorithm is a classification and regression algorithm. It subdivides learning data into regions having similar features (Figure 11). Descending the tree as in Fig. 12 allows the prediction of the class or value of the new input data point.
• quite simple
• easy to communicate about
• easy to maintain
• few parameters are required and they are quite intuitive
• prediction is quite fast
• can take a lot of memory (the more features you have, the deeper and larger your decision tree is likely to be)
• naturally overfits a lot (it generates high-variance models, it suffers less from that if the branches are pruned, though)
• not capable of being incrementally improved
Random Forest is a classification and regression algorithm. Here, we train several decision trees. The original learning dataset is randomly divided into several subsets of equal size. A decision tree is trained for each subset. Note that a random subset of features is selected for the learning of each decision tree. During the prediction, all decision trees are descended and an average is performed on all predictions, for the regression, or a majority vote is performed, for the classification (Figure 13).
• is robust to overfitting (thus solving one of the biggest disadvantages of decision trees)
• parameterization remains quite simple and intuitive
• performs very well when the number of features is big and for large quantity of learning data
• models generated with Random Forest may take a lot of memory
• learning may be slow (depending on the parameterization)
• not possible to iteratively improve the generated models.
Boosting is similar to Random Forest because it trains several smaller models to make a bigger one. In this case, models are trained one after the other (i.e the model n+1 will depend on the model n). Here, the smaller models are named « weak predictors ». The Boosting principle is to « increase » the « importance » of data that have not been well trained by the previous weak predictor (Figure 14). Similarly, the « importance » of the learning data that has been well trained before is decreased. By doing these two things, the next weak-predictor will learn better. Thus, the final predictor (model), a serial combination of the weak predictors, will be capable of predicting complex new data. Predicting is simply checking if new data is part of the blue or the red spaces, for instance, in the classification problem of Figure 14.
Main advantages are:
• parameterization is quite simple, even a very simple weak-predictor may allow the training of a strong model at the end (for instance: having a decision stump as a weak predictor may lead to great performance!)
• is quite robust to overfitting (as it’s a serial approach, it can be optimized for prediction)
• performs well for large amounts of data
• training may be time consuming (especially if we train, on top of it, an optimization approach for the prediction, such as a Cascade or a Soft-Cascade approach)
• may take a lot of memory, depending on the weak-predictor
Support Vector Machine (SVM)
The Support Vector Machine finds the separation (here, an hyperplane in a n-dimensions space) that maximizes the margin between two data populations (Figure 16). By maximizing this marge, we mathematically reduce the tendency to overfit the learning data. The separation maximizing the margin between the two populations is based on support vectors. These support vectors are the data closest to the separation and defining the marge (Figure 16). Once the hyperplane is trained, you only need to store the support vectors for the prediction. This saves a lot of memory when storing the model.
During prediction, you only need to know if your new input data point is “below” or “above” your hyperplane (Figure 17).
• is mathematically designed to reduce the overfitting by maximizing the margin between data points
• prediction is fast
• can manage a lot of data and a lot of features (high dimensional problems)
• doesn’t take too much memory to store
• can be time consuming to train
• parameterization can be tricky in some cases
• communicating isn’t easy
Neural Networks learn the weights of connections between neurons (Figure 18). The weights are adjusted, learning data point after learning data point as shown in Figure 18. Once all weights are trained, the neural network can be used to predict the class (or a quantity, in case of regression) of a new input data point (Figure 19).
• very complex models can be trained
• can be used as a kind of black box, without performing a complex feature engineering before training the model
• numerous kinds of network structures can be used, allowing you to enjoy very interesting properties (CNN, RNN, LSTM, etc.). Combined with the “deep approach” even more complex models can be learned unleashing new possibilities: object recognition has been recently greatly improved using Deep Neural Networks.
• very hard to simply explain (people usually say that a Neural Network behaves and learns like a little humain brain)
• parameterization is very complex (what kind of network structure should you choose? What are the best activation functions for my problem?)
• requires a lot more learning data than usual
• final model may takes a lot of memory.
The K-Means algorithm
This is the only non-supervised algorithm in this article. The K-Means algorithm discovers groups (or clusters) in non-labelled data. The principle of this algorithm is to first select K random cluster centers in the unlabelled data. The belonging to a group of each unlabelled data point becomes the class of the nearest cluster center. After having attributed a category to each data point, a new center is estimated within the cluster. This step is repeated until convergence. After having iterated enough, we have the labels of our previously unlabelled data! (Figure 20).
• parametrization is intuitive and works well with a lot of data.
• needs to know in advance how many clusters there will be in your data … This may require a lot of trials to “guess” the best K number of clusters to define.
• Clusterization may be different from one run to another due to the random initialization of the algorithm
Advantage or drawback:
• the K-Means algorithm is actually more a partitioning algorithm than a clustering algorithm. It means that, if there is noise in your unlabelled data, it will be incorporated within your final clusters. In case you want to avoid modelizing the noise, you might want to go to a more elaborated approach such as the HDBSCAN clustering algorithm or the OPTICS algorithm.
One-Class Support Vector Machine (OC-SVM)
This is the only anomaly Machine Learning algorithm in this article. The principle of the OC-SVM algorithm is very close to the SVM algorithm, except that the hyperplane you train here is the one maximizing the margin between the data and the origin as in Figure 21. In this scenario, there is only one class: the “normal” class, i.e all the data points belongs to one class. If your new input data point is below the hyperplane, it simply means that this specific data point can be considered as an anomaly.
Advantages and drawbacks: similar to those of the SVM algorithm presented above.
5) Choosing which algorithm to use
Now that we’ve been through some of the most popular ML algorithms, this table might help you decide which to use!
* Only non-supervised algorithm presented
** May not require feature engineering
6) Practical advices
Let’s wrap this up with some practical advice! My first advice to you is not to forget to have a look to your data before doing anything! It may save you a lot of time afterwards. Looking directly at your raw data gives you good insights.
I also deeply recommend you to work iteratively! Amongst the ML algorithms you identified as potential good approaches, you should always begin with algorithms whose parametrization is intuitive and simple. That’ll allow you to quickly define if the approach you picked is or isn’t fitting. This is especially true when you’re working on a Proof Of Concept (POC).
Although it is very important, I don’t go into much detail about the feature and its engineering. Depending on the problem, features may be obvious and easy to find in the data. In many cases, it’s enough to get well-performing models. But sometimes, you need additional features for a better training. Be careful about having too many features, that can be a problem: you might face the curse of the dimensionality problem and need a lot more data to compensate. Besides, having too many features increases the occurring of multicollinearity, and that’s not great. Fortunately, the number of dimensions (or features) can be reduced using dimension reduction algorithms (the most known algorithm being the PCA). Adding or removing features is main purpose of feature engineering.
I also recommend you to first experiment your approach in sandbox mode on a restricted dataset. High level languages such as R, Matlab or Python are perfect for that. Once and only once you validated your approach in sandbox mode, you can directly implement it in your product.
In case your problem is non-linear, algorithms such as the Naive Bayes, the linear and the logistic regression are not suitable. Other algorithms may require a different parametrization.
Concerning the performance, it is often difficult to know in advance which algorithm is going to perform the best amongst those identified as good approaches. The best way to know is often to try them all and see!
So let’s get going!
You see, machine learning isn’t out of reach! By correctly defining your problem and understanding how these algorithms work, you can quickly identify good approaches. And with more and more practice, you won’t even have to think about it!
Pages: 1 2