Decision Trees and Random Forest Project:

Predicting Potential Customers

Welcome to the project on classification using Decision Tree and Random Forest.


Context


The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market, would be worth $286.62bn by 2023, with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc., it is now preferable to traditional education.

The online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like:

The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.


Objective


ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate the resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:


Data Description


The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.

Importing the necessary libraries and overview of the dataset

Loading the dataset

View the first and the last 5 rows of the dataset

Understand the shape of the dataset

Check the data types of the columns in the dataset

Observations:

Exploratory Data Analysis

Univariate Analysis

Let's find the summary statistics and write your observations based on that. (4 Marks)

Observations:__

Observations:

Let's check the distribution and outliers for numerical columns in the data

Let's Provide observations for below distribution plots and box plots

Observations:___

Bivariate Analysis

We are done with univariate analysis and data preprocessing. Let's explore the data a bit more with bivariate analysis.

Leads will have different expectations from the outcome of the course and their current occupation may play a key role for them to take the program. Let's analyze it.

Observations:

Age can also be a good factor to differentiate between such leads. Let's explore this.

Observations:

The company's first interaction with leads should be compelling and persuasive. Let's see if the channels of the first interaction have an impact on the conversion of leads.

Observations:

We observed earlier that some leads spend more time on websites than others. Let's analyze if spending more time on websites results in conversion.

Observations:___

People browsing the website or the mobile app are generally required to create a profile by sharing their details before they can access more information. Let's see if the profile completion level has an impact on lead coversion

Observations:

Referrals from a converted lead can be a good source of income with a very low cost of advertisement. Let's see how referrals impact lead conversion status.

Observations:

plotting the correlation heatmap and write your observations based on that.

Observations:__

Data preparation for modeling

Checking the shape of the train and test data

Building Classification Models

Before training the model, let's choose the appropriate model evaluation criterion as per the problem at hand.

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a lead will not be converted to a paid customer but, in reality, the lead would have converted to a paid customer.
  2. Predicting a lead will be converted to a paid customer but, in reality, the lead would have not converted to a paid customer.

Which case is more important?

Losing a potential customer is a greater loss for the organization.

How to reduce the losses?

Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.

Decision Tree

Let's check the performance on the training data

Observations:_

Let's check the performance on test data to see if the model is overfitting.

Observations:_

Let's try hyperparameter tuning using GridSearchCV to find the optimal max_depth to reduce overfitting of the model. We can tune some other hyperparameters as well.

Decision Tree - Hyperparameter Tuning

We will use the class_weight hyperparameter with the value equal to {0: 0.3, 1: 0.7} which is approximately the opposite of the imbalance in the original data.

This would tell the model that 1 is the important class here.

We have tuned the model and fit the tuned model on the training data. Now, let's check the model performance on the training and testing data.

Observations:__

Let's check the model performance on the testing data

Observations:__

Let's visualize the tuned decision tree and observe the decision rules:

Making observations from the below visualization of the tuned decision tree.

Note: Blue leaves represent the converted leads, i.e., y[1], while the orange leaves represent the not converted leads, i.e., y[0]. Also, the more the number of observations in a leaf, the darker its color gets.

Observations:_

Let's look at the feature importance of the tuned decision tree model

Observations:

Now, let's build another model - a random forest classifier.

Random Forest Classifier

Let's check the performance of the model on the training data

Observations:__

Let's check the performance on the testing data

Observations:__

Let's see if we can get a better model by tuning the random forest classifier

Random Forest Classifier - Hyperparameter Tuning

Let's try tuning some of the important hyperparameters of the Random Forest Classifier.

We will not tune the criterion hyperparameter as we know from hyperparameter tuning for decision trees that entropy is a better splitting criterion for this data.

Observations:

Note: GridSearchCV can take a long time to run depending on the number of hyperparameters and the number of values tried for each hyperparameter. Therefore, we have reduced the number of values passed to each hyperparameter.

Note: The below code might take some time to run depending on your system's configuration.

Let's check the performance of the tuned model

Observations:__

Let's check the model performance on the test data

Observations:___

One of the drawbacks of ensemble models is that we lose the ability to obtain an interpretation of the model. We cannot observe the decision rules for random forests the way we did for decision trees. So, let's just check the feature importance of the model.

Observations:

Conclusion and Recommendations

conclusions on the key factors that drive the conversion of leads and write your recommendations to the business on how can they improve the conversion rate. (10 Marks)

Conclusions:__

Business Recommendations_