**4) Popular ML algorithms**

Now, we’re going to review the most popular ML algorithms. For each algorithm, we’ll talk about its advantages and drawbacks.

**Linear Regression**

Linear Regression is a regression algorithm (but you probably figured that out đź™‚ ). This algorithmâ€™s principle is to find a linear relation within your data (Figure 10). Once the linear relation is found, predicting a new value is done with respect to this relation.

**Main advantages :**

â€˘ very simple algorithm

â€˘Â doesn’t take a lot of memory

â€˘Â quite fast

â€˘Â easy to explain

**Main drawbacks: **

â€˘ requires the data to be linearly spread (see Â« Polynomial Regression Â» if you think you need a polynomial fitting)

â€˘ is unstable in case features are redundant, i.e if there is multicollinearity (note that, in that case you should have a look to Â« Elastic-Net or Â« Ridge-Regression Â»).

**Decision Tree**

The Decision Tree algorithm is a classification and regression algorithm. ItÂ subdivides learning data into regions having similar features (Figure 11). Descending the tree as in Fig. 12 allows the prediction of the class or value of the new input data point.

**Main advantages:**

â€˘ quite simple

â€˘ easy to communicate about

â€˘ easy to maintain

â€˘ few parameters are required and they are quite intuitive

â€˘ prediction is quite fast

**Main drawbacks:**

â€˘ can take a lot of memory (the more features you have, the deeper and larger your decision tree is likely to be)

â€˘ naturally overfits a lot (it generates high-variance models, it suffers less from that if the branches are pruned, though)

â€˘ not capable of being incrementally improved

**Random Forest**

Random Forest is a classification and regression algorithm. Here, we train several decision trees. The original learning dataset is randomly divided into several subsets of equal size. A decision tree is trained for each subset. Note that a random subset of features is selected for the learning of each decision tree. During the prediction, all decision trees are descended and an average is performed on all predictions, for the regression, or a majority vote is performed, for the classification (Figure 13).

**Main advantages:**

â€˘ is robust to overfitting (thus solving one of the biggest disadvantages of decision trees)

â€˘ parameterization remains quite simple and intuitive

â€˘ performs very well when the number of features is big and for large quantity of learning data

**Main disadvantages:**

â€˘ models generated with Random Forest may take a lot of memory

â€˘ learning may be slow (depending on the parameterization)

â€˘ not possible to iteratively improve the generated models.

**Boosting**

Boosting is similar to Random Forest because it trains several smaller models to make a bigger one. In this case, models are trained one after the other (i.e the model n+1 will depend on the model n). Here, the smaller models are named Â« weak predictors Â». The Boosting principle is to Â« increase Â» the Â« importance Â» of data that have not been well trained by the previous weak predictor (Figure 14). Similarly, the Â« importance Â» of the learning data that has been well trained before is decreased. By doing these two things, the next weak-predictor will learn better. Thus, the final predictor (model), a serial combination of the weak predictors, will be capable of predicting complex new data. Predicting is simply checking if new data is part of the blue or the red spaces, for instance, in the classification problem of Figure 14.

**Main advantages are:**

â€˘ parameterization is quite simple, even a very simple weak-predictor may allow the training ofÂ a strong model at the end (for instance: having a decision stump as a weak predictor may lead to great performance!)

â€˘ is quite robust to overfitting (as it’s a serial approach, it can be optimized for prediction)

â€˘ performs well for large amounts of data

**Main drawbacks:**

â€˘Â training may be time consuming (especially if we train, on top of it, an optimization approach for the prediction, such as a Cascade or a Soft-Cascade approach)

â€˘ may take a lot of memory, depending on the weak-predictor

**Support Vector Machine (SVM)**

The Support Vector Machine finds the separation (here, an hyperplane in a n-dimensions space) that maximizes the margin between two data populations (Figure 16). By maximizing this marge, we mathematically reduce the tendency to overfit the learning data. The separation maximizing the margin between the two populations is based on support vectors. These support vectors are the data closest to the separation and defining the marge (Figure 16). Once the hyperplane is trained, you only need to store the support vectors for the prediction. This saves a lot of memory when storing the model.

During prediction, you only need to know if your new input data point is â€śbelowâ€ť or â€śaboveâ€ť your hyperplane (Figure 17).

**Main advantages:**

â€˘ is mathematically designed to reduce the overfitting by maximizing the margin between data points

â€˘ prediction is fastÂ

â€˘ can manage a lot of data and a lot of features (high dimensional problems)

â€˘ doesnâ€™t take too much memory to store

**Main drawbacks:**

â€˘ can be time consuming to train

â€˘ parameterization can be tricky in some cases

â€˘ communicating isnâ€™t easy

**Neural networks**

Neural Networks learn the weights of connections between neurons (Figure 18). The weights are adjusted, learning data point after learning data point as shown in Figure 18. Once all weights areÂ trained, the neural network can be used to predict the class (or a quantity, in case of regression) of a new input data point (Figure 19).

**Main advantages:**

â€˘ very complex models can be trained

â€˘ can be used as a kind of black box, without performing a complex feature engineering before training the model

â€˘ numerous kinds of network structures can be used, allowing you to enjoy very interesting properties (CNN, RNN, LSTM, etc.). Combined with the â€śdeep approachâ€ť even more complex models can be learned unleashing new possibilities: object recognition has been recently greatly improved using Deep Neural Networks.

**Main drawbacks:**

â€˘ very hard to simply explain (people usuallyÂ say that a Neural Network behaves and learns like a little humain brain)

â€˘ parameterization is very complex (what kind of network structure should you choose? What are the best activation functions for my problem?)

â€˘ requires a lot more learning data than usual

â€˘ final model may takes a lot of memory.

**The K-Means algorithm**

This is the only non-supervised algorithm in this article. The K-Means algorithm discovers groups (or clusters) in non-labelled data. The principle of this algorithm is to first select K random cluster centers in the unlabelled data. The belonging to a group of each unlabelled data point becomes the class of the nearest cluster center. After having attributed a category to each data point, a new center is estimated within the cluster. This step is repeated until convergence. After having iterated enough, we have the labels of our previously unlabelled data! Â (Figure 20).

**Main advantages:**

â€˘ parametrization is intuitive and works well with a lot of data.

**Main drawbacks:**

â€˘ needs to know in advance how many clusters there will be in your data â€¦ This may require a lot of trials to â€śguessâ€ť the best K number of clusters to define.

â€˘ Clusterization may be different from one run to another due to the random initialization of the algorithm

**Advantage or drawback:**

â€˘ the K-Means algorithm is actually more a partitioning algorithm than a clustering algorithm. It means that, if there is noise in your unlabelled data, it will be incorporated within your final clusters. In case you want to avoid modelizing the noise, you might want to go to a more elaborated approach such as the HDBSCAN clustering algorithm or the OPTICS algorithm.

**One-Class Support Vector Machine (OC-SVM)**

This is the only anomaly Machine Learning algorithm in this article. The principle of the OC-SVM algorithm is very close to the SVM algorithm, except that the hyperplane you train here is the one maximizing the margin between the data and the origin as in Figure 21. In this scenario, there is only one class: the â€śnormalâ€ť class, i.e all the data points belongs to one class. If your new input data point is below the hyperplane, it simply means that this specific data point can be considered as an anomaly.

**Advantages and drawbacks:**Â similar to those of the SVM algorithm presented above.

**5) Choosing which algorithm to use**

Now that weâ€™ve been through some of the most popular ML algorithms, this table might help you decide which to use!

* Only non-supervised algorithm presented

** May not require feature engineering

**6) Practical advices**

Letâ€™s wrapÂ this up with some practical advice! My first advice to you is not to forget to have a look to your data before doing anything! It may save you a lot of time afterwards. Looking directly at your raw data gives you good insights.

I also deeply recommend you to work iteratively! Amongst the ML algorithms you identified as potential good approaches, you should always begin with algorithms whose parametrization is intuitive and simple. Thatâ€™ll allow you to quickly define if the approach you picked is or isnâ€™t fitting. This is especially true when youâ€™re working on a Proof Of Concept (POC).

Although it is very important, I donâ€™t go into much detail about the feature and its engineering. Depending on the problem, features may be obvious and easy to find in the data. In many cases, itâ€™s enough to get well-performing models. But sometimes, you need additional features for a better training. Be careful about having too many features, that can be a problem: you might face the curse of the dimensionality problem and need a lot more data to compensate. Besides, having too many features increases the occurring of multicollinearity, and that’s not great. Fortunately, the number of Â dimensions (or features) can be reduced using dimension reduction algorithms (the most known algorithm being the PCA). Adding or removing features is main purpose of feature engineering.

I also recommend you to first experiment your approach in sandbox mode on a restricted dataset. High level languages such as R, Matlab or Python are perfect for that. Once and only once you validated your approach in sandbox mode, you can directly implement it in your product.

In case your problem is non-linear, algorithms such as the Naive Bayes, the linear and the logistic regression are not suitable. Other algorithms may require a different parametrization.

Concerning the performance, it is often difficult to know in advance which algorithm is going to perform the best amongst those identified as good approaches. The best way to know is often to try them all and see!

**So letâ€™s get going!**

You see, machine learning isn’t out of reach! By correctly defining your problem and understanding how these algorithms work, you can quickly identify good approaches. And with more and more practice, you wonâ€™t even have to think about it!

Happy coding,

Pages: 1 2

## Lanvin Liu

Last paragraph of the first part, it is Figure 3 twice and not Figure 4.

## Justine Baron

Hi, thanks for that! It’s corrected đź™‚

## Anonymous

Thanks, such a nice summary of all common ML algorithms.

## preeti

thanks for those efforts ,for summarizing ,please add naive baise explanation also

## Shivakumar Panuganti

Thanks for sharing your valuable insights đź™‚

## Anonymous

This blog is awesome! It defines every concept in short and to the point with amazing layman term examples. đź™‚

## Robert J Alexander

Are you really sure that you are looking at a cat vs GOD graph? đź™‚ đź™‚

## Anonymous

Awesome blog..!!