Lately, you may have heard something called machine learning being used in all sorts of applications, but what is machine learning exactly and how does it help make predictions and decisions based on existing data? That is what I am going to explain in this blog!
Machine learning is a form of artificial intelligence where algorithms are being used to train machine learning models based on sample data also known as training data. There are many different kinds of algorithms that can be used for predictive or decisive use cases.
An example of a predictive machine learning algorithm is if someone can buy a car based on their income and age. This is an example of a classification model since you are determining if the person belongs to one of the two sides, they either can or can not buy a car.
On the other side, you have the decisive machine learning algorithm. These algorithms will help you make a decision. For example, as an employer, you have a new candidate and you don't know what salary you have to give them, you can use the salaries of your current employees according to their age and experience to give you a suggestion of the salary of the new candidate according to their age and experience. An algorithm like that is called a regression model.
Every machine learning model can be classified into one of the three different categories; supervised, unsupervised and reinforced learning. I am not going to explain the reinforced learning algorithms in this blog since this blog is will be the first one of a series about machine learning and I will only talk about the basic machine learning models in this blog.
Supervised machine learning is defined by its usage of a labelled dataset to predict or classify data. These labelled datasets are being fed into the model, the model will recognize patterns in the dataset using mathematical calculation. These models can be used for real-world applications like separating spam mails into different mailboxes.
The opposite of supervised machine learning is unsupervised machine learning, this means the algorithm is going to search for patterns in an unlabelled dataset without the need for human intervention. These kinds of machine learning models can be used for a big variety of applications like exploratory data analysis and cross-selling strategies.
Before we can start creating a machine learning model we have to do a preprocessing phase. During this phase, we are going to transform the dataset to fit the model best.
Every dataset you are going to use for machine learning must exist of at least two undependable variables and one dependable variable. Undependable variables are the variables you know beforehand, I.E. experience and age. Dependable variables are the variables you are going to predict using the machine learning algorithm, I.E. the salary based on experience and age.
After you have determined the dependable and undependable variables, you can save them in an array to use later on.
It happens, you got assigned to create a machine learning model, but the dataset you received had some missing data. There are multiple ways to make sure your algorithm will not break upon encountering one of these records. The first one is the simplest, this is just to remove the rows with missing data. This works well and won't affect the algorithm too much, except if your dataset is relatively small. The definition of a small dataset can be different for every application since a dataset used for image recognition needs to be much bigger than a dataset used for binary classification.
The second way to fill in missing data using the
Simple Imputer function from the
sklearn python library, this function accepts an argument called strategy, this argument expects one of the following values: mean, median, most_frequent or constant.
- mean: will replace missing values with the average of the other values of the column with missing data. This can only be used for numeric values.
- median: will replace missing values with the middle value if all the values are ordered. This can only be used for numeric values.
- most_frequent: will replace the missing value with the most frequent value in the column. This can only be used for numeric and character values.
- constant: will replace the missing value based on a predetermined value. This can only be used for numeric and character values. These are the most used strategies to take care of missing data without losing to the reliability of your dataset.
A computer only thinks in 0 and 1's and doesn't know anything that the values of a string mean. So we have to encode every categorical data to make sure the algorithm can see the data like integers which makes more sense for an algorithm.
To do this you need two modules from
Sklearn. The first one is the
ColumnTransformer module, which will be able to change the values within the categorical. The second module is the
OneHotEncoder module, this will be encoding the categorical data. The module will check how many categories exist within the column and will create the same amount of columns as categories and set one of the columns to one according to the category it belongs to.
The image underneath here shows you a column with three categories, France, Spain and Germany.
After encoding the categorical data it has three columns, if the first column has a one it's equal to France, the second column will be equal to Spain and the last column is equal to Germany.
The next step in the data pre-processing phase is scaling the features according to other columns. Data can vary a lot between columns, for example, you can have an age and salary column, where the age column has a range from 18 to 80 and the salary has a range from 20000 to 200000. A person will see that the columns stand for different values, but a machine learning algorithm will think the salary column is more important since it has much higher values. That is why we need to apply feature scaling to make sure the values of age and salary is in the same range. Usually, this range is from -3 to 3.
To apply feature scaling, you need the
This step is not always needed, when the algorithm uses the Euclidean distance (distance between data points) to predict the next data point, you should apply feature scaling since the distance between points is important. For example, the k-nearest neighbours' classifier will check where the to be predicted data point is closest to, if you would not apply feature scaling the output of the model will be completely different and less accurate.
The final part of data pre-processing is splitting the dataset into a dataset used for training and for testing/validation. To do this, we need the
train_test_split module from
Sklearn. This module accepts an argument called
test_size, this argument will determine what percentage of the original dataset will go into the test set and training set, usually, this is set to 0.2 or 0.3 to have enough data for the algorithm to train on and enough data to calculate the results.
You now know the basics of machine learning, what the difference is between supervised and unsupervised algorithms and how to pre-process the dataset. This will set you up to start developing the machine learning model in the next blog.