Customer Churn Prediction with Machine Learning

An end-to-end pipeline: data preparation, exploratory analysis, modeling, evaluation, and business integration.

Customer churn model ROC curve and evaluation overview
Figure 1 — Conceptual overview of churn scoring and BI consumption (ROC shown).

Customer churn erodes recurring revenue. This case study demonstrates how we built an interpretable machine-learning workflow to identify at-risk customers early and surface drivers behind churn, enabling proactive retention campaigns.

Dataset

We used a public telecom churn dataset with demographics, account, and service usage information. The target label is Churn (Yes/No). The data is categorical-heavy (e.g., Contract, InternetService) with a churn rate of ~25–30%.

Data Preparation

# Sketch of preprocessing
num_cols = ["tenure", "MonthlyCharges", "TotalCharges"]
cat_cols = ["Contract", "InternetService", "PaymentMethod", "PaperlessBilling"]

# impute/scale numerics, one-hot encode categoricals
preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([("imputer", SimpleImputer()), ("scaler", StandardScaler())]), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
    ]
)

Exploratory Analysis (EDA)

We observed higher churn among month-to-month contracts and low-tenure customers. Increased MonthlyCharges slightly raised churn propensity. Paperless billing showed notable patterns.

EDA bar charts and distributions
Figure 2 — EDA views: churn by contract type, tenure distribution, and monthly charges.

Modeling

We compared baseline and tree-based models, then tuned XGBoost:

# Sketch of pipeline & tuning
model = XGBClassifier(eval_metric="logloss", scale_pos_weight=pos_weight)
clf = Pipeline([("prep", preprocess), ("model", model)])

params = {
    "model__n_estimators": [200, 400],
    "model__max_depth": [4, 6],
    "model__learning_rate": [0.05, 0.1]
}
grid = GridSearchCV(clf, param_grid=params, scoring="roc_auc", cv=5, n_jobs=-1)
grid.fit(X_train, y_train)
best = grid.best_estimator_

Evaluation

0.86
ROC-AUC
82%
Accuracy
High
Recall (Churn)

Threshold selection favored recall to reduce missed churners. Feature importance highlighted Contract, Tenure, and MonthlyCharges as key drivers. SHAP can be added for instance-level explanations.

Feature importance bar chart
Figure 3 — Feature importance (tree-based model).

Scoring Output

We produce a BI/CRM-ready table with probabilities, risk bands, and top factors, enabling targeted retention:

CustomerID, ChurnProbability, RiskBand, TopFactors
7590-VHVEG, 0.81, High, Contract:Month-to-month; Tenure:Low; MonthlyCharges:High
5575-GNVDE, 0.65, Medium, Contract:Month-to-month; PaperlessBilling:Yes
3668-QPYBK, 0.22, Low, Contract:One year; Tenure:High

Tech Stack

Python · Pandas · Scikit-learn · XGBoost · Matplotlib · (Optional) SHAP · Power BI (consumption)

Next Steps


← Back to Portfolio Discuss a similar project