Getting started with Pandas

Pandas is by far my favourite module in Python as this was the starting point for my programming journey. When I started learning Python, I was already well-familiar with Microsoft Excel. I started translating the Excel commands to Python script using the Pandas module and eventually moved my data analytics activities there. Here, I want to share the tips to make this transition smooth for someone looking to get started with data analytics in Python.

I will assume that the reader is familiar with the basics of Python.

In this tutorial, I will demonstrate 5 starting steps, i.e., (1) loading the data, (2) exploring the dataset, (3) creating new columns, (4) applying filters, and (5) saving the new dataset. I will use the ‘billionaires’ dataset which can be downloaded here.

We are getting started by importing the Pandas module.

import pandas as pd

To load the data, we need to define its location and then apply the reader function from the module. Here, we parse the result of loading the data into a variable df (which stands for dataframe).

df = pd.read_csv(‘C:\Users\your_name\your_folder\billionaires.csv’)

Next, it is a good practice to explore the dataset before working with it. As opposed to Excel, in Python, you usually have to call separate commands to see the loaded data or its dimensions. Here are some standard commands to explore your loaded data:

df.head()
Figure 1: The output of head() method
df.columns
Figure 2: The output of columns attribute

The first command, df.head() prints the first five rows of the dataset. If there are many columns in the dataset, not all columns will be visible after calling head() method. To see all the columns that are in the given dataset, you can call the columns attribute to print the list with all columns.

df.shape
Figure 3: The output of shape attribute

The attribute shape prints the dimensions (i.e., number of rows and columns) of the dataset.

df.info()
Figure 4: The output of info() method
df.describe()
Figure 5: The output of describe() method

Next, it is important to understand the data types of the loaded variables. Method info() is handy for this operation. You can see that some variables are integers, others are objects (in this case, strings) or boolean variables (i.e., variables which take either of True or False values).

Finally, to get the summary statistics of numeric variables, describe() method is called.

Now, you might be interested in creating new columns. This step is the most straightforward for Excel users. In Excel, you would create a new column in an empty cell by manipulating the two existent columns. The same logic applies in pandas: we create the new column by mentioning which of the existent columns we manipulate with. In this case, you can get the new column ‘birth year’ by subtracting the age column from the year column. Note that you need to specify that you are taking the ‘year’ column from the df data source, hence, the notation is df[‘year’].

df['birth year'] = df['year'] - df['demographics.age']

The dataset filtering in pandas is less straightforward compared to Excel, where you can apply the filters and select values by ticking the desired selection. However, once mastered, coding conditions in pandas offers more opportunities and quicker command execution. Here, I want to show how to write conditions for two data types, i.e., string and integer. Suppose, we want to filter the rows where the sector is real estate. In this case, we specify the condition as follows:

(df['company.sector'] == 'real estate')

Note that the double equality sign means ‘is equal to’ whereas the single equality sign is used to assign values to variables.

We can also specify a second condition with an integer. Suppose we want to filter the companies which were founded before 2000, we can specify the condition as follows:

(df['company.founded'] < 2000)

It is also possible to combine these conditions by calling the following command:

df[(df['company.sector'] == 'real estate') & (df['company.founded'] < 2000)]

This command will print the filtered data frame. If you would like to save these results, you should parse these conditions to the new dataframe and work with it separately. This can be done in the following way:

df_filtered = df[(df['company.sector'] == 'real estate') & (df['company.founded'] < 2000)]
df_filtered.shape
Figure 6: The dimensions of the filtered output

You can see that there are 162 real estate companies founded before 2000.

Lastly, let us also discuss what to do with these manipulations. As you may understand, the code that we wrote in the editor did not influence the raw data file. However, if you want to save the result of our manipulations, you can use the following command:

df_filtered.to_csv('re_billionaires.csv', index=False)

It is a good practice not to overwrite the raw data, so I suggest saving it under a different name. By default, the file will be saved in the same directory as your script file.

This was a brief summary of getting started with the Pandas library. I hope you found value in this tutorial! You can follow me for more data analytics related content.

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *