As a professional in data analytics, I spend a significant amount of time on data preprocessing before I can get to run the exciting regressions or create interactive dashboards. Data preprocessing can be a tedious process but it is a crucial task that requires a lot of decisions based on human logic.
In this process I can encounter the following judgements: Which values for the numeric variables are logical? Is it likely that the age variable is negative? Is it likely that the age of the company founder is below 18? Is the category called ‘personal loan’ the same as ‘loan for personal purpose’? Can the categorical variable with 100 unique values be grouped into fewer values?
Throughout the time I have set up a 5-stage system for myself that I follow with every dataset. This system helps me to keep my work process efficient and organized. Following the outlined steps, I try to preserve the quality of the data by looking at the outliers and missing observations, evaluating these inefficiencies and correcting them based on human logic.
Here I would like to share this system in a form of a cheat sheet. The PDF version with the code outlines is also available for download at the bottom of the page. For data preprocessing I use the Pandas module in Python. If you have never worked with this module before, I suggest that you get familiar with my previous post.
The five stages are as follows:
- Get familiar with the dataset
- Adjust the raw data to your needs
- Deep dive in numeric variables
- Deep dive in numeric variables
- Final corrections
STEP 0. Import libraries.
import pandas as pd import numpy as np import os
Set working directory and import a csv file. This step assumes that the raw data file is saved in the same directory as the script file.
path = os.getcwd() df = pd.read_csv('filename.csv')
STEP 1. Get a quick overview of the data: (1) print the top 5 rows, (2) print the list of all column names, (3) print the dimensionalities of the dataset, e.g., rows*columns, (4) prints the data types of the variables.
df.head() df.columns df.shape df.info()
STEP 2. Select only the columns of interest. This step assumes that you have the variable description for your dataset and you know which columns you would need for the data analysis. If you are not sure, it is better to create a new dataframe with the selected columns to preserve the raw imported data in case you need to get back to it later.
df_filtered = df[[df['col1'], df['col2'], df['col3']]]
Rename column names with a dictionary. Here, you can specify the dictionary keys as the old names and the corresponding values as the new names of your choice.
my_dict = {'col1': 'my_var1', 'col2': 'my_var2', 'col3': 'my_var3'} df.rename(columns=my_dict, inplace=True)
STEP 3. Get the summary statistics of the numeric variables. Additionally, it is possible to check the distribution for a specific variable by printing the histogram graph. This brief overview can help to make decisions regarding data normalization.
df.describe() df['col1'].hist()
In case you can already spot the outliers (i.e., a negative number for age variable) or empty values, you can print the rows which contain these values.
# prints rows with negative values for col1 df[df['col1'] < 0]] # prints rows with empty values for col2 df[df['col2'] == np.nan]] # prints rows with combined condition df[([df['col1'] < 0]) & ([df['col2'] == np.nan])]
STEP 4. Per categorical variable, you can print (1) the number of categories; (2) the names of unique categories; (3) the count of rows per category.
df['var1'].nunique() df['var1'].unique() df.groupby(['var1']).size()
In case you can already spot the suspicious data entries, you can filter the rows based on the desired condition. Note that the categorical variables have the data type of either string or boolean.
df[df['var1'] == 'strange value']] df[df['var2'] == False] # prints rows with combined condition df[([df['var1'] == 'strange value']) & ([df['var2'] == False])]
There can be a case when you can spot that several categories can be united into one category. For instance, in the ‘country’ variable, the category ‘NL’ likely means the same as ‘The Netherlands’. You can decide to rename these categories.
df.loc[df['country'] == 'The Netherlands', 'country'] = 'NL'
STEP 5. For specific purposes such as merging datasets, you might need to insert the column with the same value for all rows. You can specify the index of the new column as the first parameter, the name of the new column and the value of the new column of the second and third parameters, respectively.
df.insert(2, 'new_col_name', value)
After the thorough data preprocessing it is good to see if any missing data is left. You can print the count of missing data per column and then decide if you wish to drop the missing observations.
df.isnull().sum() df_clean = df.dropna()
The cleaned data can be saved as a separate file. This step assumes that the clean data file is saved in the same directory as the script file.
df_clean.to_csv('filename_clean.csv', index=False)
These are the five stages of data preprocessing that I use to organize my work. In the next post, I will show how I implement these steps on a dataset. I hope you found value in this tutorial! You can follow me for more data analytics related content.
1 Comment