Machines have a hard time understanding data humans can read. Apply encoding on qualitative data to train models more efficiently.
Before we begin
Most applications will have collectedbig data
in their lifetime of usage. This data is extremely useful to their business to make improvements, but is also applicable to developing AI solutions. Machine learning is an AI solution that is best done when a company has a sufficient amount of data to detect patterns and behaviors, from ranking the present to predicting the future. In this guide, we’ll look at how to clean big data so machines may interpret it faster, more accurately, and remain human interpretable.
Before we begin
That’s a lot of data
To start off with our big data, we’ll choose data points that are letters not numbers, because machine’s do not understand numbers as well as humans. The following columns, “Education, Marital_Status”, represent qualitative textual data. Qualitative data has two forms, nominal and ordinal data. Qualitative is sometimes also called categorical.
Source: Towards Data Science
A form of qualitative data is ordinal data. This type of data requires more thinking as it pins the quality or value of the data against each other. This gives the machine insight on how each different value is proportional with respect to each other. For instance, in the data above the column for “
is ordinal data as the amount of investment varies. To complete a high school diploma (“Basic”), associate’s degree (“2n cycle”), bachelor’s degree (“Graduation”), master’s degree(“Master”), and PhD vary in terms of expertise.
A simpler form of qualitative data is nominal data. This type of data is called nominal, because it doesn’t require too many changes to transform it into something easily machine interpretable. To put it simply, all that is required is a mapping from the non-numerical value to a single numeric value. The column for marital status is an example of nominal data, because whether a person is married, divorced, or a widow, doesn’t make them better and cannot be weighted against. In fact, it’s dangerous to use weights as they can make models discriminatory and biased.
Now that we understand the differences between qualitative data types, we can take a look at how to encode them to reflect the approach of adding or removing the scales. There are two powerful encoding methods supported by SciKit learn forlabel encoding
, but we won’t be using them. Instead, we’ll be using only Pandas to truly understand the finer steps of what’s happening behind the function.
For ordinal data, we’ll want to begin by assigning labels based on their priority or values. In our education dataset, we have multiple levels of education with varying levels. We start by creating a map to rank each level of education with respect to each other on a 10 point scale.
The pinnacle of education, being the PhD ranking is a 10. On the other hand, the lowest value in the dataset will rank at a 1. Traditionally, students who continue higher education will complete up to a bachelor’s before seeking a job. In this case we shall weigh it as a 6. The 2n cycle, or associate will be equidistant from a masters with respect to the 5. I chose the equidistance as 2. As a result, our final weights will be <1, 3, 5, 8, 10>, for <’Basic’, ‘2n Cycle’, ‘Graduation’, ‘Master’, ‘PhD’>.
We store this as a map
By choosing these values, this gives the machine insight on how each different value is proportional with respect to each other. For instance, in the data above the column for “Education
is ordinal data as the amount of investment varies. To complete a high school diploma (graduation), associate’s degree (2n cycle), bachelor’s degree, master’s degree, and PhD vary in terms of expertise.
Result of the new map
One Hot Encoding
Nominal data on the other hand should be weighed equally for everything. In order to do this, we cannot give them unequal values. To accomplish mapping a non-numerical value, we take a new column for each unique value and assign whether it is present or not.
Looks like this dataset has some funny answers
Each unique value in “
requires a new column. One shortcut in Pandas is to use prefixes to create a new existing column for each value in the column.
Every present value has been changed to a 1, and non-present with a 0.
Now that we understand the differences between qualitative data types and how to encode our big data into machine understandable numbers. We learned the best practices to remove bias when converting textual data. In the next section, we’ll take a look at quantitative numerical data, and explore the differences and approaches to continuous and discrete data through scaling.