Most of the machine learning algorithms can only process numerical values. Since a lot of the datasets out there have categorical variables, a Machine Learning engineer needs to be able to convert these categorical values into numerical ones, using the right approach. Therefore he needs to know the tools that are out there and also how and when to use them. After reading this post, you will be able to do all that.
Like I already mentioned, most datasets contain categorical variables, like colors (blue, green, yellow) or sizes (S, M, L, XL), that can’t be used by many machine learning algorithms. As with many other aspects of the Machine Learning world, there is no single best solution. Each approach has his advantages and disadvantages which impact the overall quality of your prediction. Fortunately, pandas and scikit-learn provide several tools that can transform categorical data.
Table of Contents:
1. Converting variables by yourself
2. One Hot Encoding (dummy variables)
3. Label Encoding
1. Converting variables by yourself
Often times there are features that contain words which represent numbers. With Pandas it is very straight forward, to convert these text values into their numeric equivalent, by using the „replace()“ function.
If you go through the documentation of the „replace()“ function, you will see that there are a lot of different options in regards to replacing the current values. In our example we just need to create a mapping dictionary, that contains each column as well as the values that should replace them.
We will go through the exact implementation below, where we are using the Kaggle „Titanic: Machine Learning from Disaster“ Dataset. In the picture below, you can see the „Pclass„ feature, which contains values that can be easily transformed into numeric ones.
With the „value_counts()“ method, we can easily see how many different categories the „Pclass“ features has and how many entries belong to it.
Now we know that we only need to convert three categories. To replace these strings by their equivalent numbers, we need to create the mapping dictionary. Afterwards we put it into the replace function, which we call on the dataset.
Now we have successfully converted the categorical variables into their corresponding numbers. Below you can see the outcome of our transformation.
2. Label Encoding
Converting categorical variables can also be done by Label Encoding. Label Encoding simply converts each value in a column into a number. We will use Label Encoding to convert the „Embarked“ feature in our Dataset, which contains 3 different values.
First, we need to do a little trick to get label encoding working with pandas. We need to convert the „Embarked“ feature into a categorical one, so that we can then use those category values for our label encoding:
Now we can do the label encoding with the „cat.codes“ accessor:
Below you can see how the „Embarked“ feature looks now. We successfully changed it’s values into numerical ones. C is now equal to 0, Q is equal to 1 and S is equal to 2.
The problem with Label Encoding is that it’s numeric values can be “misinterpreted” by the algorithms, which we will discuss later in this post.
3. One Hot Encoding (dummy variables)
One Hot Encoding is one of the most used methods to represent categorical variables, because it’s transformed numerical values can’t be „misinterpreted“ as easily as with Label Encoding. I will explain to you, how One Hot Encoding works, with a simple example.
In the data frame below you can see seven sample inputs of data, that belongs to four categories. We could encode these again with label encoding into numeric values, but that makes no sense from a machine learning perspective. If we would do it, the machine learning model would think that the category of “Alien” is greater or smaller than “Penguin”. This would be a „misinterpretation“ of the data.
What we need to do instead is to create one boolean column for each category, where only one column can have the value 1 for each sample. In other words, the basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column. You can see this in the picture below. This is why it is called term one hot encoding.
This works very well with most machine learning algorithms and is nowadays easy to implement, thanks to our machine learning libraries that can take care of it.
Pandas supports this process with the “get_dummies” function, which creates dummy variables (1 or 0).
We look again at the „EMbarked“ feature, where we have values o of „S“, „C“ or „Q“.
By using the “get_dummies” function, we convert these values into three different columns with a 1 or 0 corresponding to the correct value.