Churn Predictor Project
Title: Churn Predictor
Start Date: 2024-09-18
Issue Date: 2024-09-30
Summary
The Churn Predictor project aimed to forecast customer churn using a machine learning approach. Customer churn refers to the likelihood of customers ceasing to use a company’s service or product, and accurate prediction enables businesses to take preventive measures to retain at-risk customers.
We utilized three classification algorithms:
- XGBoost (Extreme Gradient Boosting)
- Random Forest
- Logistic Regression
The dataset, containing 7,043 customer records, provided insights into customer demographics, service usage, and behavioral history, serving as features for the churn prediction.
Technical Approach for Churn Predictor
Data Preparation
- Categorical Conversion: Categorical features (e.g., gender, type of service) were converted to numerical values, as algorithms perform optimally with numerical data.
- Feature Selection: Columns deemed less relevant were dropped to streamline the dataset and enhance model performance.
- Model Evaluation Strategy: For each iteration, 100 random rows were chosen as the test dataset, while the remainder served as training data. This process was repeated 10 times, with each iteration using a different set of 100 test rows to ensure a robust performance evaluation. The final model accuracy was averaged across these 10 runs.
Algorithm Overview
- XGBoost
- Description: An efficient, ensemble-based model building decision trees iteratively, where each tree aims to reduce previous errors.
- Strength: High accuracy and scalability, often performing well in complex data settings.
- Random Forest
- Description: Uses an ensemble of numerous decision trees, where each tree is trained on a random subset of data and features.
- Strength: Robust against overfitting and generalizes well.
- Logistic Regression
- Description: A linear model for binary classification, modeling the probability of churn based on input features.
- Strength: Simple, interpretable, and particularly suitable for binary outcome prediction.
Results and Evaluation
After evaluating each model’s accuracy:
- Logistic Regression achieved the highest accuracy at 80.8%.
- XGBoost followed closely with an accuracy of 79.9%.
- Random Forest had a slightly lower accuracy at 79.3%.
The binary nature of churn favored the simplicity of Logistic Regression, which outperformed the more complex XGBoost and Random Forest models, typically expected to excel with more intricate datasets.
Improvement Areas in Churn Predictor
For further enhancing model performance, I suggest the following:
- Increased Data and Features: Expanding the dataset with additional demographic or behavioral attributes could improve model generalization and accuracy.
- Categorical Grouping: Rather than dropping categorical columns, we could map these categories to numerical values, retaining more of the data’s contextual information.
- Advanced Feature Engineering: Exploring advanced techniques to uncover latent patterns within the data could improve prediction accuracy.
Conclusion
This project demonstrated that Logistic Regression was the most effective model for churn prediction due to the dataset’s simplicity and the binary nature of the target variable. Although XGBoost and Random Forest performed closely, Logistic Regression proved to be the most straightforward and accurate choice for this project. With more data and refinement, future iterations could further enhance accuracy, providing deeper insights into customer behavior patterns.
My GitHub Repository: Project’s Page