Welcome to the project on classification using Decision Tree and Random Forest.
The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market, would be worth $286.62bn by 2023, with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc., it is now preferable to traditional education.
The online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like:
The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.
ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate the resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:
The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.
last_activity: Last interaction between the lead and ExtraaLearn
print_media_type1: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Newspaper
import warnings
warnings.filterwarnings("ignore")
# Libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
# Algorithms to use
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
# Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, recall_score
from sklearn import metrics
# For hyperparameter tuning
from sklearn.model_selection import GridSearchCV
learn = pd.read_csv("ExtraaLearn.csv")
# Copying data to another variable to avoid any changes to the original data
data = learn.copy()
data.head()
ID | age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | EXT001 | 57 | Unemployed | Website | High | 7 | 1639 | 1.861 | Website Activity | Yes | No | Yes | No | No | 1 |
1 | EXT002 | 56 | Professional | Mobile App | Medium | 2 | 83 | 0.320 | Website Activity | No | No | No | Yes | No | 0 |
2 | EXT003 | 52 | Professional | Website | Medium | 3 | 330 | 0.074 | Website Activity | No | No | Yes | No | No | 0 |
3 | EXT004 | 53 | Unemployed | Website | High | 4 | 464 | 2.057 | Website Activity | No | No | No | No | No | 1 |
4 | EXT005 | 23 | Student | Website | High | 4 | 600 | 16.914 | Email Activity | No | No | No | No | No | 0 |
data.tail()
ID | age | current_occupation | first_interaction | profile_completed | website_visits | time_spent_on_website | page_views_per_visit | last_activity | print_media_type1 | print_media_type2 | digital_media | educational_channels | referral | status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4607 | EXT4608 | 35 | Unemployed | Mobile App | Medium | 15 | 360 | 2.170 | Phone Activity | No | No | No | Yes | No | 0 |
4608 | EXT4609 | 55 | Professional | Mobile App | Medium | 8 | 2327 | 5.393 | Email Activity | No | No | No | No | No | 0 |
4609 | EXT4610 | 58 | Professional | Website | High | 2 | 212 | 2.692 | Email Activity | No | No | No | No | No | 1 |
4610 | EXT4611 | 57 | Professional | Mobile App | Medium | 1 | 154 | 3.879 | Website Activity | Yes | No | No | No | No | 0 |
4611 | EXT4612 | 55 | Professional | Website | Medium | 4 | 2290 | 2.075 | Phone Activity | No | No | No | No | No | 0 |
data.shape
(4612, 15)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4612 entries, 0 to 4611 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 4612 non-null object 1 age 4612 non-null int64 2 current_occupation 4612 non-null object 3 first_interaction 4612 non-null object 4 profile_completed 4612 non-null object 5 website_visits 4612 non-null int64 6 time_spent_on_website 4612 non-null int64 7 page_views_per_visit 4612 non-null float64 8 last_activity 4612 non-null object 9 print_media_type1 4612 non-null object 10 print_media_type2 4612 non-null object 11 digital_media 4612 non-null object 12 educational_channels 4612 non-null object 13 referral 4612 non-null object 14 status 4612 non-null int64 dtypes: float64(1), int64(4), object(10) memory usage: 540.6+ KB
Observations:
age
, website_visits
, time_spent_on_website
, page_views_per_visit
, and status
are of numeric type while rest of the columns are of object type.
There are no null values in the dataset.
# Checking for duplicate values
data.duplicated().sum()
0
# Write your code
# num_cols contain numerical variables
numeric_cols = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']
data[numeric_cols].describe().T ## ONLY SHOWS NUMERICAL COLUMNS
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
age | 4612.0 | 46.201214 | 13.161454 | 18.0 | 36.00000 | 51.000 | 57.00000 | 63.000 |
website_visits | 4612.0 | 3.566782 | 2.829134 | 0.0 | 2.00000 | 3.000 | 5.00000 | 30.000 |
time_spent_on_website | 4612.0 | 724.011275 | 743.828683 | 0.0 | 148.75000 | 376.000 | 1336.75000 | 2537.000 |
page_views_per_visit | 4612.0 | 3.026126 | 1.968125 | 0.0 | 2.07775 | 2.792 | 3.75625 | 18.434 |
Observations:__
# Making a list of all categorical variables
cat_col = list(data.select_dtypes("object").columns)
# Printing count of each unique value in each categorical column
for column in cat_col:
print(data[column].value_counts(normalize = True))
print("-" * 50)
EXT001 0.000217 EXT2884 0.000217 EXT3080 0.000217 EXT3079 0.000217 EXT3078 0.000217 ... EXT1537 0.000217 EXT1536 0.000217 EXT1535 0.000217 EXT1534 0.000217 EXT4612 0.000217 Name: ID, Length: 4612, dtype: float64 -------------------------------------------------- Professional 0.567216 Unemployed 0.312446 Student 0.120338 Name: current_occupation, dtype: float64 -------------------------------------------------- Website 0.551171 Mobile App 0.448829 Name: first_interaction, dtype: float64 -------------------------------------------------- High 0.490893 Medium 0.485906 Low 0.023200 Name: profile_completed, dtype: float64 -------------------------------------------------- Email Activity 0.493929 Phone Activity 0.267563 Website Activity 0.238508 Name: last_activity, dtype: float64 -------------------------------------------------- No 0.892238 Yes 0.107762 Name: print_media_type1, dtype: float64 -------------------------------------------------- No 0.94948 Yes 0.05052 Name: print_media_type2, dtype: float64 -------------------------------------------------- No 0.885733 Yes 0.114267 Name: digital_media, dtype: float64 -------------------------------------------------- No 0.847138 Yes 0.152862 Name: educational_channels, dtype: float64 -------------------------------------------------- No 0.979835 Yes 0.020165 Name: referral, dtype: float64 --------------------------------------------------
Observations:
# Checking the number of unique values
data["ID"].nunique()
4612
# Dropping ID column
data.drop(["ID"], axis = 1, inplace = True)
plt.figure(figsize = (10, 6))
ax = sns.countplot(x = 'status', data = data)
# Annotating the exact count on the top of the bar for each category
for p in ax.patches:
ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x(), p.get_height()+0.35))
Let's check the distribution and outliers for numerical columns in the data
for col in ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']:
print(col)
print('Skew :',round(data[col].skew(), 2))
plt.figure(figsize = (15, 4))
plt.subplot(1, 2, 1)
data[col].hist(bins = 10, grid = False)
plt.ylabel('count')
plt.subplot(1, 2, 2)
sns.boxplot(x = data[col])
plt.show()
age Skew : -0.72
website_visits Skew : 2.16
time_spent_on_website Skew : 0.95
page_views_per_visit Skew : 1.27
Observations:___
We are done with univariate analysis and data preprocessing. Let's explore the data a bit more with bivariate analysis.
Leads will have different expectations from the outcome of the course and their current occupation may play a key role for them to take the program. Let's analyze it.
plt.figure(figsize = (10, 6))
sns.countplot(x = 'current_occupation', hue = 'status', data = data)
plt.show()
Observations:
Age can also be a good factor to differentiate between such leads. Let's explore this.
plt.figure(figsize = (10, 5))
sns.boxplot(data["current_occupation"], data["age"])
plt.show()
data.groupby(["current_occupation"])["age"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
current_occupation | ||||||||
Professional | 2616.0 | 49.347477 | 9.890744 | 25.0 | 42.0 | 54.0 | 57.0 | 60.0 |
Student | 555.0 | 21.144144 | 2.001114 | 18.0 | 19.0 | 21.0 | 23.0 | 25.0 |
Unemployed | 1441.0 | 50.140180 | 9.999503 | 32.0 | 42.0 | 54.0 | 58.0 | 63.0 |
Observations:
The company's first interaction with leads should be compelling and persuasive. Let's see if the channels of the first interaction have an impact on the conversion of leads.
plt.figure(figsize = (10, 6))
sns.countplot(x = 'first_interaction', hue = 'status', data = data)
plt.show()
Observations:
We observed earlier that some leads spend more time on websites than others. Let's analyze if spending more time on websites results in conversion.
plt.figure(figsize = (10, 5))
sns.boxplot(data["status"], data["time_spent_on_website"])
plt.show()
Observations:___
People browsing the website or the mobile app are generally required to create a profile by sharing their details before they can access more information. Let's see if the profile completion level has an impact on lead coversion
plt.figure(figsize = (10, 6))
sns.countplot(x = 'profile_completed', hue = 'status', data = data)
plt.show()
Observations:
Referrals from a converted lead can be a good source of income with a very low cost of advertisement. Let's see how referrals impact lead conversion status.
plt.figure(figsize = (10, 6))
sns.countplot(x = 'referral', hue = 'status', data = data)
plt.show()
Observations:
plt.figure(figsize = (12, 7))
cmap="YlGnBu"
sns.heatmap(data.corr(), annot = True, fmt = '.2f', cmap = cmap)
plt.show()
Observations:__
# Separating the target variable and other variables
X = data.drop(columns = 'status')
Y = data['status']
# Creating dummy variables, drop_first=True is used to avoid redundant variables
X = pd.get_dummies(X, drop_first = True)
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 1)
Checking the shape of the train and test data
print("Shape of the training set: ", X_train.shape)
print("Shape of the test set: ", X_test.shape)
print("Percentage of classes in the training set:")
print(y_train.value_counts(normalize = True))
print("Percentage of classes in the test set:")
print(y_test.value_counts(normalize = True))
Shape of the training set: (3228, 16) Shape of the test set: (1384, 16) Percentage of classes in the training set: 0 0.704151 1 0.295849 Name: status, dtype: float64 Percentage of classes in the test set: 0 0.695087 1 0.304913 Name: status, dtype: float64
Before training the model, let's choose the appropriate model evaluation criterion as per the problem at hand.
Model can make wrong predictions as:
If we predict that a lead will not get converted and the lead would have converted then the company will lose a potential customer.
If we predict that a lead will get converted and the lead doesn't get converted the company might lose resources by nurturing false-positive cases.
Losing a potential customer is a greater loss for the organization.
Recall
to be maximized. The greater the Recall score, higher the chances of minimizing False Negatives. Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.
# Function to print the classification report and get confusion matrix in a proper format
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize = (8, 5))
sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['Not Converted', 'Converted'], yticklabels = ['Not Converted', 'Converted'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
# Fitting the decision tree classifier on the training data
d_tree = DecisionTreeClassifier(random_state = 7)
d_tree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=7)
Let's check the performance on the training data
# Checking performance on the training data
y_pred_train1 =d_tree.predict(X_train)
metrics_score(y_train, y_pred_train1)
precision recall f1-score support 0 1.00 1.00 1.00 2273 1 1.00 1.00 1.00 955 accuracy 1.00 3228 macro avg 1.00 1.00 1.00 3228 weighted avg 1.00 1.00 1.00 3228
Observations:_
Let's check the performance on test data to see if the model is overfitting.
# Checking performance on the testing data
y_pred_test1 = d_tree.predict(X_test)
metrics_score(y_test, y_pred_test1)
precision recall f1-score support 0 0.87 0.86 0.87 962 1 0.69 0.70 0.70 422 accuracy 0.81 1384 macro avg 0.78 0.78 0.78 1384 weighted avg 0.81 0.81 0.81 1384
Observations:_
Let's try hyperparameter tuning using GridSearchCV to find the optimal max_depth to reduce overfitting of the model. We can tune some other hyperparameters as well.
We will use the class_weight hyperparameter with the value equal to {0: 0.3, 1: 0.7} which is approximately the opposite of the imbalance in the original data.
This would tell the model that 1 is the important class here.
# Choose the type of classifier
d_tree_tuned = DecisionTreeClassifier(random_state = 7, class_weight = {0: 0.3, 1: 0.7})
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 10),
'criterion': ['gini', 'entropy'],
'min_samples_leaf': [5, 10, 20, 25]
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(d_tree_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
d_tree_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data
d_tree_tuned.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='entropy', max_depth=3, min_samples_leaf=5, random_state=7)
We have tuned the model and fit the tuned model on the training data. Now, let's check the model performance on the training and testing data.
# Checking performance on the training data
y_pred_train2 = d_tree_tuned.predict(X_train)
metrics_score(y_train, y_pred_train2)
precision recall f1-score support 0 0.94 0.77 0.85 2273 1 0.62 0.88 0.73 955 accuracy 0.80 3228 macro avg 0.78 0.83 0.79 3228 weighted avg 0.84 0.80 0.81 3228
Observations:__
Let's check the model performance on the testing data
# Checking performance on the testing data
y_pred_test2 = d_tree_tuned.predict(X_test)
metrics_score(y_test, y_pred_test2)
precision recall f1-score support 0 0.93 0.77 0.84 962 1 0.62 0.86 0.72 422 accuracy 0.80 1384 macro avg 0.77 0.82 0.78 1384 weighted avg 0.83 0.80 0.80 1384
Observations:__
Let's visualize the tuned decision tree and observe the decision rules:
features = list(X.columns)
plt.figure(figsize = (20, 20))
tree.plot_tree(d_tree_tuned, feature_names = features, filled = True, fontsize = 9, node_ids = True, class_names = True)
plt.show()
Note: Blue leaves represent the converted leads, i.e., y[1], while the orange leaves represent the not converted leads, i.e., y[0]. Also, the more the number of observations in a leaf, the darker its color gets.
Observations:_
Let's look at the feature importance of the tuned decision tree model
# Importance of features in the tree building
print (pd.DataFrame(d_tree_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp time_spent_on_website 0.348142 first_interaction_Website 0.327181 profile_completed_Medium 0.239274 age 0.063893 last_activity_Website Activity 0.021511 website_visits 0.000000 page_views_per_visit 0.000000 current_occupation_Student 0.000000 current_occupation_Unemployed 0.000000 profile_completed_Low 0.000000 last_activity_Phone Activity 0.000000 print_media_type1_Yes 0.000000 print_media_type2_Yes 0.000000 digital_media_Yes 0.000000 educational_channels_Yes 0.000000 referral_Yes 0.000000
# Plotting the feature importance
importances = d_tree_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize = (10, 10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Observations:
Now, let's build another model - a random forest classifier.
# Fitting the random forest tree classifier on the training data
rf_estimator = RandomForestClassifier(criterion='entropy',random_state = 7)
rf_estimator.fit(X_train, y_train)
RandomForestClassifier(criterion='entropy', random_state=7)
Let's check the performance of the model on the training data
# Checking performance on the training data
y_pred_train3 = rf_estimator.predict(X_train)
metrics_score(y_train, y_pred_train3)
precision recall f1-score support 0 1.00 1.00 1.00 2273 1 1.00 1.00 1.00 955 accuracy 1.00 3228 macro avg 1.00 1.00 1.00 3228 weighted avg 1.00 1.00 1.00 3228
Observations:__
Let's check the performance on the testing data
# Checking performance on the testing data
y_pred_test3 = rf_estimator.predict(X_test)
metrics_score(y_test, y_pred_test3)
precision recall f1-score support 0 0.88 0.93 0.90 962 1 0.81 0.70 0.75 422 accuracy 0.86 1384 macro avg 0.84 0.81 0.83 1384 weighted avg 0.86 0.86 0.86 1384
Observations:__
Let's see if we can get a better model by tuning the random forest classifier
Let's try tuning some of the important hyperparameters of the Random Forest Classifier.
We will not tune the criterion
hyperparameter as we know from hyperparameter tuning for decision trees that entropy
is a better splitting criterion for this data.
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)
# Grid of parameters to choose from
parameters = {"n_estimators": [100, 110, 120],
"max_depth": [5, 6, 7],
"max_features": [0.8, 0.9, 1]
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
# Fitting the best algorithm to the training data
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(criterion='entropy', max_depth=6, max_features=0.8, n_estimators=110, random_state=7)
# Checking performance on the training data
y_pred_train4 = rf_estimator_tuned.predict(X_train)
metrics_score(y_train, y_pred_train4)
precision recall f1-score support 0 0.91 0.92 0.91 2273 1 0.80 0.78 0.79 955 accuracy 0.88 3228 macro avg 0.86 0.85 0.85 3228 weighted avg 0.88 0.88 0.88 3228
Observations:
Note: GridSearchCV can take a long time to run depending on the number of hyperparameters and the number of values tried for each hyperparameter. Therefore, we have reduced the number of values passed to each hyperparameter.
Note: The below code might take some time to run depending on your system's configuration.
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)
# Grid of parameters to choose from
parameters = {"n_estimators": [110, 120],
"max_depth": [6, 7],
"min_samples_leaf": [20, 25],
"max_features": [0.8, 0.9],
"max_samples": [0.9, 1],
"class_weight": [{0: 0.7, 1: 0.3}, "balanced", {0: 0.4, 1: 0.1}]
}
# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)
# Run the grid search on the training data using scorer=scorer and cv=5
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Save the best estimator to variable rf_estimator_tuned
rf_estimator_tuned = grid_obj.best_estimator_
#Fit the best estimator to the training data
rf_estimator_tuned.fit(X_train, y_train)
RandomForestClassifier(class_weight='balanced', criterion='entropy', max_depth=6, max_features=0.8, max_samples=0.9, min_samples_leaf=25, n_estimators=120, random_state=7)
Let's check the performance of the tuned model
# Checking performance on the training data
y_pred_train5 = rf_estimator_tuned.predict(X_train)
metrics_score(y_train, y_pred_train5)
precision recall f1-score support 0 0.94 0.83 0.88 2273 1 0.68 0.87 0.76 955 accuracy 0.84 3228 macro avg 0.81 0.85 0.82 3228 weighted avg 0.86 0.84 0.84 3228
Observations:__
Let's check the model performance on the test data
# Checking performance on the test data
y_pred_test5 = rf_estimator_tuned.predict(X_test)
metrics_score(y_test, y_pred_test5)
precision recall f1-score support 0 0.93 0.83 0.87 962 1 0.68 0.85 0.76 422 accuracy 0.83 1384 macro avg 0.80 0.84 0.82 1384 weighted avg 0.85 0.83 0.84 1384
Observations:___
One of the drawbacks of ensemble models is that we lose the ability to obtain an interpretation of the model. We cannot observe the decision rules for random forests the way we did for decision trees. So, let's just check the feature importance of the model.
importances = rf_estimator_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
plt.figure(figsize = (12, 12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Observations:
conclusions on the key factors that drive the conversion of leads and write your recommendations to the business on how can they improve the conversion rate. (10 Marks)