It is always an open discussion – which factors contributed to the tremendous success of the richest people? In one of my favourite books on behavioural science, “The Outliers” by Malcolm Gladwell, the author discusses that the combination of exceptional emotional intelligence, hard work, knowledge and the ability to grasp the chance of destiny is something that unites the most financially successful people.
As much as this conclusion is disputable, the topic of financial success remains attractive for researchers in various areas. I also got inspired to do a little machine learning experiment after coming across the ‘billionaires’ dataset.
Here I want to demonstrate in a form of a tutorial the steps that I took to run a logistics regression classification algorithm. My research question was ‘can the model predict if a billionaire gained their wealth from founding the company (not inheriting or acquiring)?’
A regression estimates how the dependent variable (Y) is influenced by the independent variables (X). In the case of logistic regression, the Y variable is binary and the estimators predict the probability of Y being 1 given the change in the X variable.
The steps that I took to answer the research question are as follows:
Data cleaning. In the first part, I reviewed the data set to identify the variables of interest, clean the data in those columns and create some new columns. More information on the processes that I apply for data preprocessing can be found in my previous post. In this part, I used the Pandas library.
import pandas as pd import os import numpy as np
path = os.getcwd() df = pd.read_csv('billionaires.csv')
After loading the dataset, it can be visible that it follows a structure of the billionaire rank by year. Hence, many data entries contain the same unit (same person) but in different years when they were ranked as a billionaire. For my research, I was only interested in unit characteristics, not the time dimension of the rank, therefore, I dropped the duplicates and kept the first data entry for each unit, i.e., the year when they were first ranked as a billionaire.
# drop the duplicates based on name df = df.drop_duplicates(subset=['name'], ignore_index=True)
Next, I reviewed the variables of interest by applying unique() method. The dataset was cleaned from the inconsistent and missing observations.
df = df[(df[demographics.age'] > 0)] df = df[(df['company.founded'] != 0)] df = df[(df['location.region'] != '0')] df = df[(df['demographics.age'] != 0)] df = df[df['wealth.how.category'] != '0']
After that, the variable company age was created, where ‘company age’ refers to the age of the company when its owner was ranked as a billionaire. Also, the variable ‘gender’ was renamed to match the integer data type.
# age of company when the founder became billionaire df['company age'] = df['year'] - df['company.founded'] df.loc[df['demographics.gender'] == 'male', 'demographics.gender'] = 1 df.loc[df['demographics.gender'] == 'female', 'demographics.gender'] = 0
Next, I worked through the label variable, which equals 1 if the ‘wealth.type’ column signified that the billionaire was a founder of the company. To apply this filter, the unique category names were renamed for consistency.
df.loc[df['wealth.type'] == 'self-made finance', 'wealth.type'] = 'founder finance' df['founder'] = np.where(df['wealth.type'].str.contains('founder'), 1, 0)
In the final part of the data cleaning process, I dropped multiple columns which were not relevant for the research and verified the rest of the missing values to judge if those observations can be dropped without jeopardizing the data quality.
cols_drop = ['name', 'company.name', 'location.country code', 'rank', 'year', 'birth year', 'company.relationship', 'location.citizenship', 'wealth.how.from emerging', 'wealth.how.was political', 'wealth.how.inherited', 'wealth.how.was founder', 'company.founded', 'company.type', 'wealth.type', 'company.sector', 'wealth.how.industry', 'start age'] df = df.drop(columns = cols_drop) df = df.dropna()
Next, the columns were renamed.
my_dict = {'demographics.age': 'age',
'demographics.gender': 'gender',
'location.gdp': 'GDP',
'location.region': 'region',
'wealth.how.category': 'sector',
'wealth.worth in billions': 'wealth'}
df.rename(columns=my_dict, inplace=True)
The final dataset contained 1673 observations with the following values:
- Age: the age of the billionaire when first ranked
- Gender: the gender of the billionaire
- Location: the region in which the billionaire holds a citizenship status
- GDP: the GDP of the country where the billionaire holds a citizenship status
- Wealth: the wealth (in billions) of the billionaire
- Sector: the sector where the billionaire’s company operates
- Company age: the age of the company when first ranked billionaire
- Founder: dummy variable equal to 1 if the billionaire founded the company
Data encoding. Data encoding is a practice in machine learning meant to translate the data into computer language. Here, the numerical variables needed to be normalized to [0,1] interval.
num_var = ['age', 'company age', 'GDP', 'wealth'] numeric_attr = df[num_var] + 1e-7 numeric_attr = numeric_attr.apply(np.log) df_num_transformed = (numeric_attr - numeric_attr.min()) / (numeric_attr.max() - numeric_attr.min())
Categorical variables needed to be encoded into dummies. The simple way to deal with the categorical variables is one-hot encoding, where a dummy for each category is created. Note that the gender column was excluded from the encoding process as it is was already containing only binary variables, i.e., 1 for ‘male’ and 0 for ‘female’.
gender = df.pop('gender')
cat_var = ['region', 'sector']
df_cat_transformed = pd.get_dummies(df[cat_var])
This process results in the creation of N new columns, where N is the number of categories. The picture below illustrates this process:
After that, all the transformed features (X-variables) are combined in one dataset. This dataset now has 18 columns.
Additionally, note that the ‘founder’ (Y-variable) was already encoded into dummy variables. This variable is stored separately.
label = df.pop('founder')
df_transformed = pd.concat([gender, df_cat_transformed, df_num_transformed], axis=1)
For the further implementation of the machine learning algorithm, the data were transformed to array type.
X = df_transformed.values
Y = label.values
Feature selection. In this step, I ran a selection algorithm to select the 10 best features which have the most explanatory power on the dependent variable Y. The selection produce is based on chi-squared statistical test. In this step, the Sklearn module is introduced to the programming process.
from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 # Define the test function test = SelectKBest(score_func=chi2, k=10) # Fit the X and Y values on the test function fit = test.fit(X, Y)
Here, I initialize the selection algorithm with the parameters of the chi-squared statistical test and the number of features that I would want to select. I fit the X and Y variables to obtain their scores for statistical tests. The ten features with the highest scores are therefore selected in this process.
It is important to consider the compromise regarding the number of features. Fewer variables result in less precision, and too many variables result in the overtrained model. I would like to discuss this topic more in my next posts, but for now, I choose 10 features.
For convenience, I can also print the names of the selected features. As it can be seen, the gender dummy, region dummies and sector dummies were selected from the categorical variables. From numerical variables, only the ‘company age’ variable had a high score.
# Summarize scores np.set_printoptions(precision=3) print(fit.scores_) # Check which columns are selected by retrieving the indexes cols = test.get_support(indices=True) print(cols) # Extract the names of the columns given the indexes colname = df_transformed.columns[cols] print(colname)
I finalize the feature selection process by storing the selected features separately in an array data type. This new selection is now going to be used as the X variable.
features = fit.transform(X)
Finally, we split the features and Y data into train and test set using the embedded function. Here, the test size is specified to be 25% of the complete dataset.
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(features,Y,test_size=0.25,random_state=0)
Logistic regression model setup. The Logistic Regression model can be loaded from sklearn module. After the initialization of the model, I used the X and Y train sets to fit the model on the given dataset. Next, I predicted the Y labels with the pre-trained model on the X-test set.
# Create a Logistic Regression Classifier logreg = LogisticRegression() # Train the model using the training sets logreg.fit(X_train, y_train) # Predict the response for test dataset y_pred_LR = logreg.predict(X_test)
Result visualization. In the final stage of this tutorial, I would like to show how you can visualize the output of the model using a confusion matrix. In this final stage, I am comparing the true labels stored in Y_test with the predicted label stored in y_pred_LR. Then in a confusion matrix, the correctly predicted values are displayed on diagonal. The incorrectly reported values are reported off the diagonal.
# Logistic Regression confusion matrix cnf_matrix_LR = metrics.confusion_matrix(y_test, y_pred_LR) # Create heatmap sns.heatmap(pd.DataFrame(cnf_matrix_LR), annot=True, cmap="YlGnBu" ,fmt='g') ax.xaxis.set_label_position("top") plt.tight_layout() plt.title('Confusion matrix Logistic Regression', y=1.1) plt.ylabel('Actual label') plt.xlabel('Predicted label') plt.show()
In this case, it can be seen that the model did pretty good. A total of 166 non-founder-billionaires and 164 founder-billionaires were predicted correctly.
From these numbers, other precision parameters can be concluded. For instance, model accuracy, which is the ratio of correctly predicted observations, is computed in the following way: (164+166)/(164+166+33+56) = 79%. This number you could also get from the metrics package.
from sklearn import metrics print(metrics.accuracy_score(y_test, y_pred_LR))
Conclusion. In this tutorial, I implemented a logistic regression classification model on the billionaires’ characteristics. I aimed to verify if the model could predict the likelihood that the billionaire obtained their wealth from founding a company (rather than inheriting or acquiring). It can be seen that with the selected features the model was 79% accurate.
I would say that for now, it was enjoyable to try out the classification algorithm on this dataset. However, for further research, I would modify a selection of explanatory variables to include more of the personal characteristics, such as the highest education level, family income and other income sources.
Thank you for reading this article, I hope it was enjoyable and you found value in it! Please let me know if you have any remarks on my explanations, and I hope to see you in my next blog posts!
References
Gladwell, Malcolm, 1963- author. Outliers: the Story of Success. New York: Little, Brown and Company, 2008
1 Comment