Predicting Customer Churn for OTT and SaaS

End-to-end churn analysis project using the Telco Customer Churn dataset. This project identifies churn drivers, predicts churn probability, and segments high-risk users for business action.

Dataset: Telco Customer Churn (Kaggle) | Models: Logistic Regression, Random Forest | Goal: Reduce churn through data-driven retention strategy
Python EDA + Visualization Churn Prediction

7032

Customers analyzed

26.58%

Overall churn rate

0.7946

Logistic Regression accuracy

0.7584

Random Forest accuracy

19.80%

High-risk users

Key Findings

High Early Churn

New users show the highest churn rates compared to long-tenure users.

Price Sensitivity

Users with higher monthly charges are more likely to churn.

Support Signal

Frequent support interactions are associated with elevated churn risk.

Project Objectives

  • Calculate churn rate across customer groups
  • Identify key factors causing churn
  • Build a prediction model for churn probability
  • Segment users into Low, Medium, and High risk
  • Recommend practical retention strategies

Workflow

  1. Data loading and validation from the source CSV
  2. Data cleaning and preprocessing
  3. Feature engineering for engagement and risk signals
  4. Exploratory analysis and correlation study
  5. Model training and evaluation
  6. Risk segmentation and business recommendation generation

Open Chart Analysis to view all generated charts with explanations.

Installation Steps

  1. Clone the repository:
    git clone https://github.com/bikram73/Subscription_Churn_Analysis.git
  2. Move into the project folder:
    cd Subscription_Churn_Analysis
  3. Create virtual environment:
    python -m venv .venv
  4. Activate environment (Windows PowerShell):
    .venv\Scripts\Activate.ps1
  5. Install dependencies:
    pip install -r requirements.txt
  6. Run the pipeline:
    python subscription_churn.py

Tech Stack

Data Processing

Pandas, NumPy

Visualization

Matplotlib, Seaborn

Machine Learning

Scikit-learn

The original Telco dataset does not directly include OTT telemetry fields like usage frequency, last login days, and support calls. These are engineered as deterministic proxy features.