Rare Event Calibration Lab
You have 10,000 customers. Only 5% will churn. A naive model learns that saying "won't churn" is correct 95% of the time — so it stops identifying churners at all. Oversampling fixes the learning problem. But it quietly breaks the probability scale. This lab shows you both effects — and the one-line fix.
OVERVIEW & LEARNING OBJECTIVES
The Setup: Imagine you're building a churn model. 10,000 customers. 500 will actually churn (5%). The other 9,500 won't. You train a logistic regression on this data and it learns a dark secret: if it just predicts "won't churn" for every single customer, it's correct 95% of the time. Great accuracy. Useless model. It will never alert you to a churner because it learned that ignoring them is the safe bet.
The Fix — and its hidden cost: Oversampling artificially boosts the churn cases in training data — say, from 5% to 50% — so the model sees enough examples to actually learn what a churner looks like. This genuinely works: the model's ability to rank high-risk customers improves dramatically. But now the model thinks it lives in a world where 50% of people churn. Its raw probability scores are inflated 5–15× compared to reality. If you use those scores directly for budget decisions, you'll massively overspend.
The Correction: A one-line algebraic formula (King & Zeng, 2001) scales every raw score back to the true population rate. No retraining. No new data. You get the discrimination improvement and trustworthy probabilities.
📋 Step-by-Step: How to Use This Tool
Click Run Simulation below. Four models will train on the same 10,000-observation dataset — one with no oversampling, and three with progressively aggressive oversampling (20%, 35%, 50% positive rate in training).
AUC (0 to 1) measures how well the model rank-orders customers by risk. AUC = 0.50 is a coin flip — useless. AUC = 0.90 means a truly risky customer outranks a safe one 90% of the time. Watch how much AUC rises as oversampling increases. This is the win.
The true churn rate is 5%. Mean Pred % is what the model thinks the average customer's probability is. For oversampled models this number will be 3–10× higher than 5%. The "Inflation" chip shows the exact multiple. A 4× inflation means every probability is four times too high.
Enable Show King-Zeng Corrected Probabilities above the condition cards. The Mean Pred % will snap back to ~5%. The mini-charts will shift back toward the diagonal. But AUC will not change at all — the correction rescales probabilities but preserves rank order perfectly.
🎯 What You'll Learn
- Why "natural" models fail on rare events: The model sees 1 churner per 19 non-churners. It learns that predicting "no" is almost always correct — earning high accuracy but near-random ability to find actual churners. AUC near 0.5 means barely better than a coin flip. Check the Natural Sampling card after running — this is why oversampling exists.
- What oversampling fixes (and what it doesn't): Oversampling dramatically improves AUC — how well the model rank-orders customers by risk. It does NOT fix probability estimates — in fact, it makes them dangerously wrong.
- What "calibration" means in plain English: If the model says a customer has a 10% churn probability, are roughly 10% of those customers actually churning? Calibration measures truthfulness of the probability number itself — not just whether it ranks customers in the right order.
- When the probability number matters: Rank-ordering only? (Top 20% of customers for a calling campaign) → AUC is all you need. Dollar-value calculations? (Bid = p × value, CLV = p × margin × tenure) → You need calibration too. A 5× inflated probability → 5× miscalculated bids.
- The King-Zeng correction: One formula. No retraining. Rescales every raw score back to the true population rate by accounting for the ratio between training positive rate and true positive rate.
💡 Concrete dollar example: Programmatic ad bidding: Bid = p(click) × revenue_per_click. If your model outputs p = 0.40 but truth is p = 0.05, you bid 8× too much per impression. On a $100K monthly budget, that's roughly $87,500 wasted on overpriced inventory. Calibration is a financial requirement, not just a statistical nicety.
📐 Mathematical Foundations
✅ Act 1 — Why Oversample (The Fix)
With τ = 5% natural data, gradient descent sees 1 positive per 19 negatives each iteration. The intercept b₀ gets pushed deeply negative, p̂ ≈ 0 for nearly everything, and the gradient for the slope b₁ vanishes. The model never learns who is risky — just that almost no one is.
Keeping all positives and subsampling negatives to 50/50 creates a balanced gradient in every training step. Now b₁ gets a strong, consistent learning signal — AUC rises substantially.
⚠️ Act 2 — The Calibration Cost (The Problem)
The model's intercept b₀ now calibrates to a 50% base rate — because that's the world it trained in. On the real holdout (5% positives), raw predictions are 5–15× too high.
Better discrimination. Broken probability scale.
King-Zeng Prior Correction (2001):
$$\hat{p}_{\text{corrected}} = \frac{\hat{p}_{\text{raw}}}{\hat{p}_{\text{raw}} + (1 - \hat{p}_{\text{raw}}) \cdot \dfrac{s \,(1-\tau)}{(1-s)\,\tau}}$$s = positive rate in training set | τ = true population base rate | When s = τ, this reduces to p̂corrected = p̂raw
Reliability Diagram (Calibration Curve):
Bin all holdout predictions into deciles by predicted probability. For each bin, plot the mean predicted probability (x-axis) against the observed fraction of positives (y-axis). Points on the diagonal = perfect calibration. Points above the diagonal = probabilities are inflated (model is overconfident).
| Metric | What It Measures | Effect of Oversampling |
|---|---|---|
| Sensitivity @50% | Fraction of churners caught at a 50% decision threshold | ↑ Improves dramatically — 0% → 60%+ (the primary operational win) |
| AUC-ROC | Rank ordering quality (discrimination) | ≈ Unchanged — logistic regression rank order is class-balance invariant |
| Brier Score | Mean squared probability error (lower = better) | ↑ Worsens — probabilities inflated |
| ECE | Expected Calibration Error (lower = better) | ↑ Worsens significantly |
| Mean Predicted Prob. | Average output score on holdout | ↑ Should equal τ; with oversample it exceeds τ |
⚠️ Verification Check: After applying the correction, the mean predicted probability on any representative holdout should approximately equal τ. If it doesn't, check whether your training positive rate s and true base rate τ are specified correctly.
SIMULATION SETTINGS
KEY INSIGHTS
🧠 The Analyst's Playbook for Rare-Event Models
🎯 Discrimination vs. Calibration — Two Different Jobs
Discrimination (AUC): Can the model rank customers correctly? Does the high-risk customer score higher than the low-risk one? You need this for prioritization, targeting, and triage.
Calibration: Do predicted probabilities mean what they say? If the model says 10%, do roughly 10% of those customers actually convert? You need this for any dollar-value calculation.
The workflow: Oversample to get discrimination. Apply the correction to restore calibration. You don't have to choose between them.
📊 When Does Calibration Actually Matter?
| Use Case | Need AUC? | Need Calibration? | Reason |
|---|---|---|---|
| Direct mail to top 10% of model | ✅ Yes | ❌ No | Only rank matters — who's in the top decile |
| Programmatic bid pricing | ✅ Yes | ✅ Yes | Bid = p × value; bad p → bad bid |
| CLV estimation | — | ✅ Yes | CLV formulas directly multiply probability |
| Churn score thresholding | ✅ Yes | ⚠️ Sometimes | Depends on whether threshold is absolute or relative |
| A/B test lift measurement | — | ✅ Yes | Comparing predicted lift requires calibrated probability differences |
🔧 Calibration Correction Methods Compared
| Method | Complexity | Best For |
|---|---|---|
| King-Zeng formula | ⭐ One formula | Logistic regression with known oversampling rate |
| Platt scaling | ⭐⭐ Fit a 2nd logistic model | SVM, neural networks, any model with known holdout |
| Isotonic regression | ⭐⭐⭐ Non-parametric | Large holdout set, any model, no shape assumption |
| Temperature scaling | ⭐⭐ Single parameter | Neural networks, quick recalibration post-training |
⚠️ What This Simulation Doesn't Show
This tool uses negative subsampling: all positive (rare) examples are kept; the majority class is subsampled to hit the target ratio. In practice, SMOTE generates synthetic interpolated positives instead of subsampling, which avoids discarding negatives. The King-Zeng calibration cost applies to both approaches equally — anything that shifts the training class balance shifts the intercept.
The simulation also uses a single continuous predictor with logistic regression. With tree-based models (XGBoost, Random Forest), raw probability outputs are inherently miscalibrated even without oversampling — Platt scaling or isotonic regression are the standard fix there.