Data Visualization: Building A Bridge Between Data Analytics And Storytelling

When it comes to communicating the insights of data analysis, data visualization plays a crucial role. A comprehensive and visually appealing graph sets a reliable environment between data analysts and their readers or listeners.

In my experience, I mostly worked with exploratory data visualization, i.e., I used the graphs to get quick insights from the data for myself. When working with regressions, I would call a scatter plot to get the idea of the variable correlation. Also, the histograms came in handy to verify the distribution of the variables.

These general plotting functions are directly accessible in the Pandas module with Series.hist() or DataFrame.plot.scatter() method. However, it would be rather simple to take data visualization to the next level with the Seaborn library.

In this tutorial, I would like to demonstrate how I would switch from exploratory to explanatory data analysis. I will use the data on P2P lending that I have access to as an investor on Mintos, the worldwide P2P marketplace.

If you would like to get the raw data, please refer to the Mintos webpage. It is possible to register on Mintos free of charge without an initial deposit. Furthermore, you can find more insights about P2P lending in my previous post.

Assume we have the following variables:

  • Average loan term aggregated per unit per time interval;
  • Average interest rate aggregated per unit per time interval;
  • Average loan volume on a logarithmic scale (i.e., number of loans) per unit per time interval,

where a unit is a world region and the time interval is monthly.

First, it is a good practice to define a question that we would like to answer with our graph. Given the nature of the variables, we can formulate the following research question:

What is the correlation between loan volume and loan interest rate? How does it vary per region?

We can also straightaway formulate a hypothesis. Let us assume that in regions where P2P lending is not popular, i.e., in the regions with low loan volume, the average interest rate would be higher. The idea here is that P2P is considered too risky due to markets’ unfamiliarity, and borrowers are therefore required to pay higher interest.

Now, we can construct the graph with a simple code using Seaborn and Matplotlib libraries.

import seaborn as sns 
import matplotlib.pyplot as plt

First, we set the graph background theme. Next, we pass the data into sns.replot() function. The data parameter indicates the dataframe that we are working with, x and y parameters are loan volume and interest rate, respectively. In our case, we can also specify the colour and the size of the data points. Here, we specify the region to be distinguished with colours and the loan duration to be distinguished with the data point size.

sns.set_theme() 
sns.relplot(data=df, x='count', y='avg_interest', 
            hue='World Region', palette='RdYlGn', 
            size='Average loan duration', sizes=(1,300)) 
plt.xlabel('P2P loan volume') 
plt.ylabel('Interest rate') 
plt.show()

The graph looks like this:

Figure 1: P2P lending volume against P2P interest rates worldwide

Alright, let us identify what we can already observe. First, we can see that the interest rates vary between 6% and 19%, whereas loan volume varies between above 0 and 14 on the logarithmic scale, meaning, from 3.4 to 25 million.

Second, there is no defined trend visible in the correlation of loan volume and interest rate. For all loan volumes, interest rates stay in the range of 10-14%. There are some outliers where high-interest rates were detected for a high level of loan volume, and low-interest rates – for a low level of loan volume. This observation is contradicting our initial hypothesis.

Finally, on a regional level, we can see that Africa and Central and Eastern Europe are dominating the market in terms of P2P lending adoption. Furthermore, it can be seen that the loans with the longest duration were issued in Central and Eastern Europe. The loans with the smallest average duration (below 10 months) are mostly observed for Western Europe and Asia.

As you can see, we were able to get a lot of insights at the first glance. Of course, to make viable conclusions, we would have to estimate the relationship between these variables using a regression, where observations would be clustered per month and region. For now, I would say we have achieved a goal of making our visual analysis more appealing and we have also grasped some insights about data distribution.

Thank you for making it to the end, I hope you found value in this article! You can follow me for more data analytics related content or contact me directly via LinkedIn.

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *