Customer Churn Prediction with Machine Learning

An end-to-end pipeline: data preparation, exploratory analysis, modeling, evaluation, and business integration.

Arc Innovations · 2025 · Python · Scikit-learn · XGBoost · Pandas

Customer churn model ROC curve and evaluation overview — Figure 1 — Conceptual overview of churn scoring and BI consumption (ROC shown).

Customer churn erodes recurring revenue. This case study demonstrates how we built an interpretable machine-learning workflow to identify at-risk customers early and surface drivers behind churn, enabling proactive retention campaigns.

Dataset

We used a public telecom churn dataset with demographics, account, and service usage information. The target label is Churn (Yes/No). The data is categorical-heavy (e.g., Contract, InternetService) with a churn rate of ~25–30%.

Records: ~7,043 · Features: 21
Split: 80/20 train–test with stratification
Imbalance handled via class weights / thresholding

Data Preparation

Standardized binary fields (Yes/No → 1/0), coerced numeric types (e.g., TotalCharges).
One-hot encoding for multi-class categoricals (Contract, InternetService).
Scaled numeric features (tenure, MonthlyCharges).
Stratified split to preserve churn ratio across train–test.

# Sketch of preprocessing
num_cols = ["tenure", "MonthlyCharges", "TotalCharges"]
cat_cols = ["Contract", "InternetService", "PaymentMethod", "PaperlessBilling"]

# impute/scale numerics, one-hot encode categoricals
preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([("imputer", SimpleImputer()), ("scaler", StandardScaler())]), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
    ]
)

Exploratory Analysis (EDA)

We observed higher churn among month-to-month contracts and low-tenure customers. Increased MonthlyCharges slightly raised churn propensity. Paperless billing showed notable patterns.

EDA bar charts and distributions — Figure 2 — EDA views: churn by contract type, tenure distribution, and monthly charges.

Modeling

We compared baseline and tree-based models, then tuned XGBoost:

Baselines: Logistic Regression, Random Forest
Best: XGBoost with class weighting & hyperparameter search (depth, estimators, learning rate)
Metrics: Accuracy, Precision/Recall/F1, ROC-AUC; business goal prioritized recall

# Sketch of pipeline & tuning
model = XGBClassifier(eval_metric="logloss", scale_pos_weight=pos_weight)
clf = Pipeline([("prep", preprocess), ("model", model)])

params = {
    "model__n_estimators": [200, 400],
    "model__max_depth": [4, 6],
    "model__learning_rate": [0.05, 0.1]
}
grid = GridSearchCV(clf, param_grid=params, scoring="roc_auc", cv=5, n_jobs=-1)
grid.fit(X_train, y_train)
best = grid.best_estimator_

Evaluation

0.86

ROC-AUC

82%

Accuracy

High

Recall (Churn)

Threshold selection favored recall to reduce missed churners. Feature importance highlighted Contract, Tenure, and MonthlyCharges as key drivers. SHAP can be added for instance-level explanations.

Feature importance bar chart — Figure 3 — Feature importance (tree-based model).

Scoring Output

We produce a BI/CRM-ready table with probabilities, risk bands, and top factors, enabling targeted retention:

CustomerID, ChurnProbability, RiskBand, TopFactors
7590-VHVEG, 0.81, High, Contract:Month-to-month; Tenure:Low; MonthlyCharges:High
5575-GNVDE, 0.65, Medium, Contract:Month-to-month; PaperlessBilling:Yes
3668-QPYBK, 0.22, Low, Contract:One year; Tenure:High

Tech Stack

Python · Pandas · Scikit-learn · XGBoost · Matplotlib · (Optional) SHAP · Power BI (consumption)

Next Steps

Deploy as a lightweight API for real-time scoring.
Schedule monthly retraining & drift monitoring.
Integrate with CRM to automate retention workflows.

← Back to Portfolio Discuss a similar project