– If you want to directly jump to the algorithm breakdown, go to page 2 –
We frequently hear about Machine Learning in the media, especially since the recent wave of interest in deep-learning. The perpetual improvement of Machine Learning techniques combined with the ever increasing amount of data that are stored suggests endless new applications. Many innovative solutions emerge: autonomous driving, next generation supermarkets with implicit payment, next generation chatbots that can interact with you as human beings would do, and so on. More than ever, the future seems within reach. But the more extravagant and original the application is, the more the layman is put off. The plethora of Machine Learning algorithms and approaches only increases that feeling. In this article, we’ll dispel this myth and we’ll give you tips on choosing your ML algorithm to solve your problem with minimal effort.
1. Why and when to use Machine Learning ?
But first of all, what problem can ML solve ? ML is trendy, for sure, but we should not forget that ML’s main goal is to help us solve problems that are difficult to solve with traditional programming. What we can do with ML algorithms is learn complex decision systems from data (for instance: to forecast quantities or give the likely belonging of a data point to a category, Fig 1.1), discover latent structure in unexplored data to find patterns that nobody would expect (in Fig 1.2: Picasso’s paintings organized by periods and artistic style, ok people would have expected this I guess 🙂 ), find anomalies in data (for instance: to automatically raise an alert if something suspicious happens in the data, Fig 1.3). ML is very useful to automatically treat complex and/or large amount of data.
Let’s first compare the traditional programming approach with the ML approach. As you can see in Figure 2, in the traditional programming approach, we start with data and a program we wrote to take this data as an input. We finally obtain the results as an output. In the ML approach we do it a little bit differently: we start with data AND the results we know we obtained on this dataset before, then we train a program as an output. The obtained program will be used in input of a traditional programming approach after that.
Now you might think:
« Ok great, I now understand when to use ML better and how it is different to the traditional programming approach. »
« Yet, I’d like a more concrete example ».
Ok, let’s give a concrete example then :). What about the face detection problem in Figure 3 ? In the case of face detection, the dataset is composed of images of face and images of background. The results known on these are either 1, which means detected, for face images, or 0, which means nothing detected for background images. The final program is the face predictor. At the end, the « face predictor » program will be used as in the traditional programming paradigm and the data will be image chunks of a bigger image in which we wish to detect faces. The results will be : « this image chunk does not contain a face » or « this image chunk contains a face ». In Figure 3, image chunks that are said to contain faces are surrounded by a green border.
2. Defining your problem
At first, it’s very important to define the problem to better solve it afterwards. This can easily be done by answering these three questions: 1) what do we want to do? 2) what is available? and 3) what are my constraints?
What do you want to do ?
Do you want to predict a category ? That’s classifying. For instance, you want to know if an input image belongs to the cat category or the dog category.
Do you want to predict a quantity ? That’s regression. For instance, knowing the area of the floor plan of a house, where it is, whether it has a garage or not, predicting its value on the market. In this case, go for a regression approach because you want to predict a price ie. a quantity, not a category.
Do you want to detect an anomaly ? That’s anomaly detection … 🙂 You want to detect money withdrawal anomalies. Imagine that you live in England and you have never been abroad, and that money has been withdrawn 5 times in Las Vegas from your bank account. In this case you might want the bank to detect that and prevent it to happen it from happening again.
Do you want to discover structure in unexplored data? That’s clustering. For instance: imagine having a large amount of website logs, you might want to explore them to see if there are groups of similar visitor behavior in your website logs. These groups of visitor behaviors might help you improve your website.
What is available?
How much data do you have? Of course, this depends on the problem you want to solve and the kind of data you’re playing with. Knowing the amount of data you have is important. If you have more than 100.000 data points you will be able to use every kind of algorithm!
Do your data points have labels? That is, do we know the category of each data point we have? If we know the category an image belongs to, we know the label (Figure 4). If we don’t, then we cannot label them… (Figure 5).
Do you have a lot of features to work with? The number of features you have might influence your algorithm choice. In the case of house price forecasting, you might need to know the total area of the floor plan of the house, the number of floors, the proximity to the city center, and so on. The more features you have, the more accurate your analysis will be. Too many or too few features will restrict your choice of algorithm. Having too many features might increase the occurrence of redundant features… Features that are correlated, such as the area of the house and its inner volume, affect the performance of some algorithms.
How many classes do you have? Knowing how many classes (categories) is important for some ML algorithms, especially for some exploratory ML algorithms.
What are your constraints?
What is your data storage capacity? Depending on the storage capacity of your system, you might not be able to store gigabytes of classification/regression models or gigabytes of data to clusterize. This is the case, for instance, for embedded systems.
Does the prediction have to be fast? In real time applications, it is obviously very important to have a prediction as fast as possible. For instance, in autonomous driving, it’s important that the classification of road signs be as fast as possible to avoid accidents, obviously…
Does the learning have to be fast? In some circumstances, training models quickly is necessary: sometimes, you need to rapidly update, on the fly, your model with a different dataset.
Two also very important aspects we, enthusiastic developers, have a tendency to forget is the maintainability of the solution we choose, and, communication.
Maintainability: it is sometime more judicious to go for a simpler solution giving correct results, instead of a very sophisticated solution you’re not 100% confident with giving slightly better results. We might not be able to easily update the solution or correct a bug in the future.
Communication: we, developers, are sometime working with non-developers 🙂 For some projects it is sometime necessary to expose your solution to people of other professions. In this case, it might be judicious to go for a ML solution that is more suitable for layman.
3) A little bit of theory
A bit of theory is required before going any further. Let’s first talk about the different existing ML approaches: the supervised, the non-supervised and the semi-supervised approach. Then, we’ll talk about very important notions in ML: the bias and the variance.
In supervised learning, all our data points are labeled. The goal is to find a good separation between classes, as you can see in Figure 6. Here, we want to correctly separate blue labeled data points and red labeled data points.
In non-supervised learning, the input data points are NOT labeled. The goal here is to group data points by similarity or proximity. Then, labels may be attributed to the groups of data points.
In semi-supervised learning, both approaches are mixed. A model is first trained using few labeled data point. Unlabeled data points are used later to further improve the model. This approach is very interesting because we often encounter situations where we have few labeled data points and a large amount of available unlabeled data points. Famous semi-supervised approaches are Active Learning and Co-Training. In Active-Learning, users periodically need to manually label data and thus incorporate these new data for next trainings. Co-Training is interesting because it does not require any human interaction: two or more predictors are learned on different « views » of the same unlabeled data points (i.e using different sets of features). The classifiers are then tested against new incoming unlabelled data points. Misclassified data points are then reincorporated for next training rounds to correct these errors. This approach requires the definition of a measure of confidence to be sure one or both the classifiers missed the classification, though.
Bias and variance
Bias and variance are two important notions in ML. They are indicators you should always keep an eye on when training your models. They will allow you to have an idea of what the performance of your model is going to be with new input data.
The bias is the error due to erroneous learning assumptions. It simply means that you did not train your model correctly, i.e, that you wrongly separated your data (Figure 8). If you have a high bias, it means that you missed your learning very much.
The variance is the error from sensitivity to small fluctuations in the learning dataset. A high-variance means you fitted your learning data too well. In that case, you won’t be able to adapt to new input data points. ML learning’s main goal is to generate a model that can be generalized to any new input data. Thus, fitting the learning data too much is contrary to this objective. As a concrete example, imagine you trained a model that’s too fitted your learning data as in Figure 9. In that case, if you want to predict the belonging of a new input data point to the blue or the red class, this model will yield that your new input data point belongs to the blue class (Figure 9). Whereas, naturally, you would have expected this new data point to be marked as a red data point, because three other data points are surrounding it.
One of the biggest Machine Learning’s goals is to train a model that can be generalized to new data. If your model is not capable to correctly predict on new data, then your training is useless. As you’ve seen above, having a variance too high doesn’t allow your model to correctly generalize new data. And, quite obviously, having a bias too high doesn’t allow the model to learn from the data at all.
K-fold cross-validation is one way to do it: the original learning data is randomly partitioned into K different folds with the same size. At each step, one fold is selected to test the performance of the model, and (K-1) folds are used for the training. This step is repeated K times. If your model doesn’t suffer from high-variance (aka: overfitting in the ML community), then you should have homogenous performances for the K cases. If your model is performing well (low biais) you should also have a high average performance for the K cases.
– Go to page 2 –
Pages: 1 2