In this notebook I use the RandomForestClassifier in Scikit-learn to predict if a person will buy life insurance.¶

I import data (from a database, using SQL), encode the data, remove outliers, perform oversampling, train a model (using GridSearchCV to fine tune the parameters), evaluate the model, remove the least important feature, train a second model (with one fewer feature), evaluate the second model, and make predictions on unseen data.¶

In [1]:
rand_state = 7

Import packages and data¶

In [2]:
# Scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

# Other modules/settings
from imblearn.over_sampling import RandomOverSampler
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from joblib import dump
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline
In [3]:
# Import English data
conn = sqlite3.connect('life_insurance.db')

df = pd.read_sql_query("""
    SELECT
        age, salary, number_of_kids, has_degree, is_married, sex, bought_insurance
    
    FROM
        life_policies
    
    WHERE
        country = 'england'
    """, conn)

conn.close

df.head(4)
Out[3]:
age salary number_of_kids has_degree is_married sex bought_insurance
0 35 95469 0 Yes Yes Female No
1 42 23859 2 Yes No Female No
2 36 45412 2 Yes Yes Male No
3 46 40202 3 Yes No Male No

Print overview of dataframe¶

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2599 entries, 0 to 2598
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   age               2599 non-null   int64 
 1   salary            2599 non-null   int64 
 2   number_of_kids    2599 non-null   int64 
 3   has_degree        2599 non-null   object
 4   is_married        2599 non-null   object
 5   sex               2599 non-null   object
 6   bought_insurance  2599 non-null   object
dtypes: int64(3), object(4)
memory usage: 142.3+ KB

Encode data¶

In [5]:
def transform_categoricals(df):
    column_mappings = {
        'sex': {'Female': 0, 'Male': 1},
        'is_married': {'No': 0, 'Yes': 1},
        'has_degree': {'No': 0, 'Yes': 1}
    }
    
    return df.replace(column_mappings)

df = transform_categoricals(df)
In [6]:
df.head(3)
Out[6]:
age salary number_of_kids has_degree is_married sex bought_insurance
0 35 95469 0 1 1 0 No
1 42 23859 2 1 0 0 No
2 36 45412 2 1 1 1 No

Remove outliers¶

In [7]:
remove_outliers_from = ['age'
                        ,'salary'
                        ,'number_of_kids'
                       ]

# Remove rows with values 3 standard deviations above/below the mean
for col in remove_outliers_from:
    mean = df[col].mean()
    std_dev = df[col].std()
    lower_cutoff = mean - 3 * std_dev
    upper_cutoff = mean + 3 * std_dev

    pre_removal = df.shape[0]
    
    df = df[(df[col] > lower_cutoff) & (df[col] < upper_cutoff)]

    post_removal = df.shape[0]

    if post_removal < pre_removal:
        print(str(pre_removal-post_removal) + " outlier value(s) identified in '" + col + "' removed.")
1 outlier value(s) identified in 'salary' removed.

Define target and feature vector¶

In [8]:
target_variable = 'bought_insurance'
In [9]:
y = df[target_variable]
X = df.drop([target_variable], axis=1)

Perform oversampling of minority class¶

In [10]:
insurance_counts = y.value_counts()

insurance_counts.plot(kind='bar', rot=0, color=['green', 'red'])
plt.xlabel('Purchased life insurance?')
plt.ylabel('Count')
plt.title('Portion of 30-65 year olds who purchased life insurance')

plt.show()
No description has been provided for this image
In [11]:
OverSampler = RandomOverSampler(sampling_strategy =1)
X, y = OverSampler.fit_resample(X, y)
y.value_counts()
Out[11]:
bought_insurance
No     1694
Yes    1694
Name: count, dtype: int64

Split data into test and train datasets¶

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=rand_state)

X_test.head()
Out[12]:
age salary number_of_kids has_degree is_married sex
2024 60 34039 2 0 1 0
3182 51 46302 4 0 1 0
1969 48 47502 1 1 1 1
1559 44 33797 5 0 1 0
87 33 82144 2 1 1 1

Create Random Forest and make predictions¶

In [13]:
# Define parameters
param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [7, 10, 20],
    'min_samples_split': [2, 5, 10]
}
In [14]:
# Initialise RandomForestClassifier and define parameters
RandomForrestC = RandomForestClassifier(random_state=rand_state)

# Initialise and fit grid search
grid_search = GridSearchCV(estimator=RandomForrestC
                           ,param_grid=param_grid
                           ,cv=5 # 5-fold cross-validation
                          )

grid_search.fit(X_train, y_train)

# Get the best parameters and estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

# Make predictions with the best estimator
y_prediction = best_estimator.predict(X_test)

Evaluate model¶

In [15]:
def evaluate_model(y_test, y_prediction, best_params):
    score = round(accuracy_score(y_test, y_prediction),3)
    print('Model accuracy score is ' + str(score) + "\n")  

    print("Full classification report:\n" + classification_report(y_test, y_prediction))
        
    print("Best model is...\nMax depth: "
           + str(best_params['max_depth'])
           + ", Min samples split: "
           + str(best_params['min_samples_split'])
           + ", Number of estimators: "
           + str(best_params['n_estimators'])
          )

    return score

score1 = evaluate_model(y_test, y_prediction, best_params)
Model accuracy score is 0.895

Full classification report:
              precision    recall  f1-score   support

          No       0.93      0.86      0.89       575
         Yes       0.87      0.93      0.90       544

    accuracy                           0.89      1119
   macro avg       0.90      0.90      0.89      1119
weighted avg       0.90      0.89      0.89      1119

Best model is...
Max depth: 20, Min samples split: 2, Number of estimators: 500

Identify most important features¶

In [16]:
feature_scores = pd.Series(best_estimator.feature_importances_, index=X_train.columns).sort_values(ascending=False)

plot = sns.barplot(x=feature_scores, y=feature_scores.index)
plt.xlabel('Feature importance score')
plt.ylabel('Feature')
plt.show()
No description has been provided for this image

Retrain/retest the model without least important feature and save model¶

In [17]:
worst_feature = feature_scores.sort_values(ascending=True).index[0]
In [18]:
X = X.drop([worst_feature], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = rand_state)
In [19]:
# Initialise new RandomForestClassifier and define parameters
RandomForrestC_2 = RandomForestClassifier(random_state=rand_state)

# Initialise and fit grid search
grid_search = GridSearchCV(estimator=RandomForrestC_2
                           ,param_grid=param_grid
                           ,cv=5 # 5-fold cross-validation
                          )

grid_search.fit(X_train, y_train)

# Get the best parameters and estimator
best_params_2 = grid_search.best_params_
best_estimator_2 = grid_search.best_estimator_

# Make predictions with the best estimator
y_prediction = best_estimator_2.predict(X_test)
In [20]:
score2 = evaluate_model(y_test, y_prediction, best_params_2)
Model accuracy score is 0.901

Full classification report:
              precision    recall  f1-score   support

          No       0.94      0.86      0.90       575
         Yes       0.87      0.94      0.90       544

    accuracy                           0.90      1119
   macro avg       0.90      0.90      0.90      1119
weighted avg       0.90      0.90      0.90      1119

Best model is...
Max depth: 20, Min samples split: 2, Number of estimators: 500
In [21]:
if score1 < score2:
    higher_lower = "higher"
    best_estimator = best_estimator_2
    remove_worst_predictor = True
    dump(RandomForrestC_2, 'rf_model.joblib')
else:
    higher_lower = "lower"
    dump(RandomForrestC, 'rf_model.joblib')

print('Retrained model accuracy score is ' + str(score2) + " (" + str(round(score2-score1, 3)) + " " + higher_lower + ")" )
Retrained model accuracy score is 0.901 (0.006 higher)

Use model to predict on new data¶

In [22]:
df_to_predict_on = pd.read_excel("data_to_predict_on.xlsx", sheet_name="candidates")

df_to_predict_on.head(3)
Out[22]:
age salary number_of_kids has_degree is_married sex
0 35 50100 0 Yes No Male
1 39 55000 1 Yes No Female
2 22 24001 0 No No Male
In [23]:
df_to_predict_on = transform_categoricals(df_to_predict_on)

if remove_worst_predictor:
    df_to_predict_on = df_to_predict_on.drop([worst_feature], axis=1)

df_to_predict_on.head(3)
Out[23]:
age salary number_of_kids has_degree is_married
0 35 50100 0 1 0
1 39 55000 1 1 0
2 22 24001 0 0 0
In [24]:
predictions = best_estimator.predict(df_to_predict_on)

df_to_predict_on['predicted_outcome'] = predictions

df_to_predict_on
Out[24]:
age salary number_of_kids has_degree is_married predicted_outcome
0 35 50100 0 1 0 No
1 39 55000 1 1 0 No
2 22 24001 0 0 0 No
3 62 45045 2 1 0 Yes
4 49 47500 3 1 1 Yes