Syria-Tel Customer Churn Analysis
An in-depth walk-through of a classification model project
Overview
This article will walk through the process of building a classification model in order to predict customer churn for SyriaTel , a telecommunications company. Customer churn can be defined as the rate at which customers stop doing business with a company. The process for this analysis will follow the Cross Industry Standard Process for Data Mining (CRISP DM). We will first analyze the SyriaTel dataset in order to gain an understanding of the business. Next the data will be cleaned and processed in preparation for the modeling phase. Classification models will then be built using the data. The Models will then tuned in order to optimize model performance.
The two main evaluation metrics for this project will be Precision and Area Under the Curve score (AUC). Precision was chosen as the primary metric because it measures how many predicted positives are actual positives, and because predicting ‘True’ churn while avoiding False-Positives is the primary goal of the models this metric makes sense for the purposes of this project. AUC is also relevant to the business problem because a higher AUC score is correlated with better true positive predictions and better overall model performance.
Business Objective/Methods
SyriaTel is interested in knowing the most important factors in determining whether a customer will stay with the company. The primary goal of this project is to help SyriaTel keep current customers and offer business strategies on how to do so based on the analysis provided.
Using customer account data, this notebook will analyze what features from the data are most important in predicting customer churn, or whether they will leave the company. In order to do this Logistic Regression, Decision Trees, as well as Random Forest models will be built aimed at producing the highest possible Precision and AUC metrics, as previously stated.
Data Description
The data for this project is found at https://www.kaggle.com/becksddf/churn-in-telecoms-dataset. This datasets contains SyriaTel customer account data consisting of 3,333 rows and 21 columns. The data does not contain any missing values.


Exploratory Analysis
In beginning the exploratory analysis of the data we will first check the distributions of each feature using histograms and count plots. One main takeaway from this analysis is that the dataset is very imbalanced with regard to our dependent variable of ‘Churn’. About 86% of customers from this data are “False Churn”, while only about 14% of customers are “True Churn” customers. This finding will need to be addressed before building models in order to prevent class imbalance. Without addressing this, class imbalance would be a problem because if our models were to predict “True Churn” every time, 86% of the predictions would be correct based on the imbalance, which we will obviously want to avoid.

Next we will also check for multicollinearity between the features using a correlation heat map and avoid running models with any highly correlated features. We will also engineer a new feature called “Total Charge” which will be the sum of all the “Charge” features, as we expect charge to play a strong role in predicting whether a customer leaves the company as seen from the findings in the pivot table below. Next we will split the data into training and testing sets, label encode our dependent variable (Churn), and finally One Hot Encode categorical features in the data.






Model A (Logistic Regression) Training
Now that the data has been explored and preprocessed, the first model we will build will be a Logistic Regression model from the SK Learn library. This model will be a basic “vanilla” model in order to get a feel for how our data performs. We will first be sure to scale our data and set the class weight equal to “balanced”, in order to address the class imbalance problem mentioned previously.
Model A Summary/Evaluation



Model A top features:
1. Customer Service Calls
2. Total Charge
3. International Plan
- Precision Score (test data): 40%
- According to the testing data classification report Model A shows a 40% precision score for true-churn. This would indicate a weak model based on our main evaluation metric of precision.
- Area under the curve (test data): 84%
- Model A does not appear to be overfitting based on the similar results of metrics from the training and testing datasets.
Model A-2 (Logistic Regression w/ Tuned Parameters) Training
For our second model, Model A-2, we will again run a logistic regression model except this time we will use Grid Search CV in order to tune our model parameters in order to find the best parameter combination. We will also reduce the number of features to the top four features from Model A to see if this helps improve our model’s performance.

Model A-2 Summary/Evaluation


Model A-2 top features:
1. Customer Service Calls
2. Total Charge
3. International Plan (no)
- Precision Score (test data): 39% (Down 1% from Model A
- Area under the curve (test data): 84% (Same as Model A)
- Model A2 does not appear to be overfitting based on the similar results of metrics from the training and testing datasets.
- Because reducing features did not improve precision score, will likely include all features from Model A in future models.
Model B Training
For our next model we will run a basic Decision Tree vanilla classifier in order to how this model type compares to our logistic regression models.
Model B Summary/Evaluation

Model B top features:
1. Total Charge
2. Customer Service Calls
3. Number Voice Mail Messages
- Precision Score (training data): 100% (Up 61% from Model A2)
- Area under the curve (test data): 88% (Up 4% from Model A2)
- Model B is overfitting based on perfect scores for the test data. The model is likely making too many splits and parameters will need to adjusted going forward in order to produce a more reliable model, particularly the “max depth” parameter.
Model B-2 Training
For Model B-2 we will again use a decision tree model, except this time we will first run a Grid Search in order to find the best parameter combinations. This should address the overfitting problem seen in the previous model.
Model B-2 Summary/Evaluation


Model B-2 top features:
1. Total Charge
2. Customer Service Calls
3. Voice Mail plan (no)
- Precision Score (test data): 93% (Highest precision score so far)
- Area under the curve (test data): 89% (Highest AUC score so far)
- Model B2 does not appear to be overfitting based on similar training and testing results.
-As expected, running a Grid Search and tuning parameters has greatly improved the performance of the decision tree model, producing our best model thus far.
Model C Training
Our next model will be a Random Forest vanilla classifier without running a Grid Search.
Model C Summary/Evaluation

Model C top features:
1. Total Charge
2. Customer Service Calls
3. Total International Minutes
- Precision Score (test data): 97% (Highest precision score so far)
- Area under the curve (test data): 89% (Highest AUC score, tied with Model B2)
- Model C appears to be overfitting based on perfect training data results. For our final model we will run again adjust parameters by running a Grid Search, this should address the overfitting issue seen in Model C.
Model C-2 Training
Our final Model will be a Random Forest with tuned parameters using a Grid Search.
Model C-2 Summary/Evaluation

Model C-2 top features:
- According to Model C, the top features in predicting churn are:
1. Total Charge
2. Customer Service Calls
3. Total International Minutes
- Precision Score (test data): 100% (Highest precision score of all models)
- Area under the curve (test data): 89% (Highest AUC score, tied with Model B2 and Model C)
- Model C2 does not appear to be overfitting based on similar training and testing results.
- Because Model C2 features the highest precision score and AUC (our main evaluation metrics), we will use this model as primary model for making business recommendations.

Business Recommendations/Conclusion
Based on the overall analysis I would recommend the following three business strategies to SyriaTel in order to prevent losing customers to churn in the future.
1. The most important factor in losing customers to churn according the analysis is a customer having a high ‘Total Charge’. This would seem to make sense as customers who are paying the most would probably become unhappy with their service and seek out an alternative provider. One solution to this could be to monitor customers’ accounts and create a promotion or discount for customers with the highest total charge in order to prevent them from canceling their service due to high costs in the future.
2. The second most important factor in losing customers to churn appears to be Customer Service Calls. The more a customer calls customer service, the more likely they are to cancel their service. One way to prevent this would be to monitor accounts and notice when customer service calls are above average. A promotion or discounted rate could then be offered to these customers in order to prevent them from canceling their service. Another cheaper solution could be to preemptively reach out to these customers in order resolve the issue of why they keep calling for help from customer service. Finally a third way to resolve this could be to overall improve the customer service department, thus improving the customer experience and preventing customer service issues before they become a problem.
3. The third most important factor in losing customers to churn appears to be ‘Total International Minutes’. The more total international minutes a customer has, the more likely they are to cancel their service. One solution to this could be to offer a better international plan to these customers in order to prevent them from seeking out an alternative provider. A promotion such as this could also potentially persuade international customers from other service providers to switch to SyriaTel, thus increasing the customer base even further.
Future Work
1. In order to better improve the models, we will likely need to test models on larger customer datasets to see if models remain accurate across a larger customer sample.
2. We could also test models on competitors’ datasets to see if models are similar across different wireless companies within the industry.
3. Finally we could also build several more unique models as well tuning more model parameters to see if any further conclusions can be reached.