Propensity Score Matching

Causal Inference

Estimate causal treatment effects from observational marketing data using propensity score matching. Upload data with a treatment indicator and outcome, select covariates for balance, and the tool performs nearest-neighbor matching to create comparable treated and control groups.

1
Upload Data
2
Configure Variables
3
Run Matching
4
Interpret Results

👨‍🏫 Professor Mode: Guided Learning Experience

New to propensity score matching? Enable Professor Mode for step-by-step guidance through matching methods, balance diagnostics, and interpreting causal treatment effects!

OVERVIEW & APPROACH

Here is the mistake that gets made in boardrooms every week:

A company launches a loyalty program. Enrolled members spend $1,840 per year on average. Non-enrolled customers spend $1,210. The VP of Marketing opens a slide deck and announces: "The loyalty program drives $630 in incremental annual spend per customer."

This analysis is almost certainly wrong — and the error is not arithmetic. It is causal.

Why the Simple Comparison Lies

Loyalty programs don't recruit customers randomly. Customers who choose to enroll tend to already be the company's best, most engaged buyers — higher prior spend, more frequent visits, longer tenure. They were going to spend more regardless of the program. When you compare enrolled vs. non-enrolled customers, you are not comparing the effect of the program. You are comparing two different kinds of people.

This is the selection bias problem. The treatment (enrollment) was not randomly assigned — customers self-selected in. And the very characteristics that made them enroll are the same characteristics that drive higher spending. So the naive gap of $630 is a mixture of two things that are impossible to disentangle without a more careful approach:

Real program effect
The causal lift the program actually created — what you want to measure
Selection artifact
Pre-existing advantages of the type of customer who enrolled — noise you need to remove

A randomized experiment would solve this cleanly by assigning enrollment randomly. But in most real marketing settings — CRM programs, ad exposure, promotional offers, app features — you simply cannot run a randomized experiment after the fact. You have observational data about what happened in the world, and you need to extract a credible causal answer from it.

The Core Idea: Compare Like With Like

Propensity score matching does not try to compare all treated customers to all untreated customers. Instead, it asks: for each customer who enrolled, can we find a customer who did NOT enroll but who looked nearly identical before the enrollment decision — same spending history, same visit frequency, same demographics?

If we can construct those pairs, we are no longer comparing different types of people. We are comparing the same type of person under two conditions. Any remaining outcome gap is credibly attributable to the program itself.

Enrolled customers self-select — they look systematically different from non-enrolled. Propensity scores compress all covariates into one matchable number. Each treated customer finds its nearest statistical twin. Compare only within pairs — the gap is the causal effect, not selection bias.

How Propensity Score Matching Works

The challenge with "find a similar customer" is that people have many characteristics simultaneously. Matching on ten variables at once quickly becomes impossible — there may be no pair that agrees on everything. The propensity score collapses all that complexity into a single, matchable number.

Step 1 — Estimate the propensity score: $$ e(X_i) = \Pr(T_i = 1 \mid X_i) = \text{logit}^{-1}(\beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi}) $$ Fit a logistic regression that predicts who enrolled from their observable characteristics \(X\). Each customer's fitted probability — their propensity score \(\hat{e}(X_i)\) — is a summary of "how likely someone with this profile was to enroll." Two customers with the same score are statistically equivalent on all the characteristics the model saw, even if their individual values differ.

Step 2 — Match on the score: For each customer who enrolled (treated), find the non-enrolled customer (control) with the closest propensity score. This creates matched pairs where both people looked equally likely to enroll — making treatment assignment as good as random within each pair.

Step 3 — Estimate the treatment effect: $$ \text{ATT} = \frac{1}{n_1} \sum_{i:\, T_i=1} \bigl[ Y_i - Y_{j(i)} \bigr] $$ The Average Treatment Effect on the Treated (ATT) is the mean outcome difference within matched pairs. Because the pairs are comparable, this gap is the causal program effect — not the selection artifact. If the loyalty program really does change behavior, it shows up here. If the naive $630 gap shrinks dramatically after matching, that is evidence that most of it was pre-existing customer quality, not the program working.

Where PSM is most valuable in marketing practice

PSM is the right tool whenever treatment was not randomly assigned and you need a causal answer rather than a correlation. Common marketing applications include:

  • CRM & loyalty programs — did enrollment cause higher spend, or did high-spenders self-select in?
  • Email & push campaigns — did the campaign cause conversions, or were recipients already highly engaged?
  • Ad exposure — did the ad lift brand awareness, or were exposed users already category enthusiasts?
  • App feature adoption — does using Feature X cause retention, or do retained users just happen to discover it?
  • Sales interventions — did the rep's call cause the deal to close, or were those accounts already warm?

What you need: (1) a binary treatment indicator (enrolled vs. not, received vs. not), (2) an outcome to measure, and (3) pre-treatment covariates that predict who self-selected into treatment. The richer your covariate set, the more selection bias you can remove.

Key limitation: PSM only removes bias from observed characteristics. If customers who enrolled also had unmeasured motivation (e.g., stronger brand affinity you never surveyed), that hidden factor still confounds your estimate. PSM makes the comparison as clean as your data allows — but it cannot manufacture a randomized experiment from observational records. This is why domain knowledge about why people self-select matters as much as the method itself.

ATT vs. ATE: which effect are you actually estimating?

PSM with 1:1 matching estimates the Average Treatment Effect on the Treated (ATT): among the customers who actually enrolled, what was the average effect compared to their matched non-enrollees? This is the most actionable causal question for most marketing decisions, because it reflects the program's impact on the people it actually reached.

The Average Treatment Effect (ATE) asks a different question: what would the effect be on a randomly selected person from the full population, whether or not they would normally enroll? ATE is relevant when you are considering rolling the program out to everyone, including people who would never have enrolled voluntarily. PSM typically estimates ATT, not ATE — keep this distinction in mind when generalizing results.

🚧
Reality Check: Why Marketers Sometimes Resist This Method

Here is something most textbooks don’t tell you: propensity score matching will almost always produce a smaller effect estimate than the naive comparison — and that is politically inconvenient.

The mechanism is direct. The naive estimate is inflated by selection bias: your best, most engaged customers self-selected into the program, so they were going to perform better anyway. PSM strips that inflation away. What remains is the causal effect of the program itself — which is smaller, more honest, and harder to put in an exec deck.

In practice, this dynamic surfaces in recognizable ways:

  • “We already know the program works.” — Resistance to running the analysis at all.
  • Buried results. — The analysis gets done; the smaller number gets quietly omitted from the presentation.
  • Methodological negotiation. — Pressure to try a wider caliper, drop covariates, or “try a different model” until the number looks better.
  • Competing decks. — The naive number keeps circulating in marketing slides while the PSM estimate lives in a footnote.

This is not a reason to avoid PSM — it is a reason to understand what you are up against when you use it. A correctly smaller number is more valuable than an impressive wrong one. The analyst who can explain why the PSM estimate is smaller, and defend it under pressure, is doing something genuinely difficult and genuinely useful.

DATA SOURCE

📚

Use a Case Study

Use presets to explore causal inference scenarios, such as loyalty program enrollment or email campaign exposure. Each scenario provides a raw data file with treatment, outcome, and covariates.

📤

Upload Your Data

Upload a CSV file with case-level data including: a binary treatment indicator (0/1), an outcome variable, and covariates for matching. Headers are required.

Drag & Drop raw data file (.csv, .tsv, .txt, .xls, .xlsx)

Include headers; must have treatment (binary), outcome, and covariate columns.

No file uploaded.

DATA & VARIABLE SELECTION

PROPENSITY SCORE DIAGNOSTICS

Propensity Score Distribution

Interpretation Aid

Technical Interpretation

What you're seeing: The histograms show the distribution of estimated propensity scores for treated (blue) and control (orange) groups. Each propensity score represents the predicted probability that a unit received treatment, based on its covariate values.

Good overlap: When both distributions cover similar ranges (e.g., both spanning 0.2–0.8), matching can find comparable units. The key assumption of PSM—that we can find similar treated and control units—is satisfied.

Warning signs: If treated units cluster near 1.0 and controls near 0.0 with little overlap, the propensity model is "perfectly predicting" treatment. This means groups are fundamentally different on observed characteristics, and matching will fail or produce few matches.

Practical Interpretation

Marketing example: In a loyalty program analysis, if customers who enrolled all have very high propensity scores while non-enrollees have very low scores, it means enrolled customers were systematically different (e.g., higher prior spend, more visits). Finding a "fair comparison" will be difficult.

What to do if overlap is poor: (1) Use fewer or different covariates, (2) collect more data, or (3) consider that the treatment effect may not be estimable with this data—the groups are too different to compare.

Covariate Balance: Before vs. After Matching

Interpretation Aid

Technical Interpretation

Standardized Mean Difference (SMD): Each point shows how different treated and control groups are on a covariate, measured in pooled standard deviation units. An SMD of 0.2 means the groups differ by 0.2 standard deviations on that variable.

Balance thresholds: SMD < 0.1 indicates excellent balance (green zone). SMD 0.1–0.25 is acceptable but shows residual imbalance (yellow). SMD > 0.25 suggests meaningful differences that may bias the ATT estimate (red).

Before vs. After: Open circles show pre-matching imbalance; filled circles show post-matching balance. Successful matching moves all points toward zero (the center dashed line).

Practical Interpretation

What it tells you: This "Love Plot" shows whether matching created comparable groups. If all filled circles (after matching) are within the ±0.1 bands, you can be confident that treated and matched control groups are similar on observed characteristics.

Marketing example: If "Prior_Spend" has SMD = 0.5 before matching (treated spent more) but SMD = 0.05 after matching, the matched comparison controls for prior spending behavior. The ATT now compares customers with similar spending histories.

Action: If any covariate remains imbalanced (SMD > 0.25) after matching, consider adding it to the outcome model or interpreting results cautiously—remaining confounding may bias the effect estimate.

MATCHING SUMMARY

Treated Units --
Control Units --
Matched Pairs --
Unmatched Treated --
Mean PS (Treated) --
Mean PS (Matched Control) --
Understanding Matching Quality

Technical Interpretation

Matched pairs: The number of treated units successfully matched to control units within the caliper threshold. This determines the effective sample size for ATT estimation.

Unmatched treated: Treated units outside the "common support" region—they have propensity scores too extreme to find comparable controls. High unmatched rates (>20%) suggest limited overlap.

Mean propensity scores: After matching, mean PS should be nearly identical between treated and matched controls. Differences > 0.05 suggest residual imbalance.

Practical Interpretation

Sample size trade-off: PSM often discards observations (unmatched treated or unused controls). A match rate of 80%+ is typically good; below 50% suggests the groups may be too different for meaningful comparison.

Who gets excluded: Unmatched treated units are often "extreme" cases—e.g., customers with very high engagement who enrolled in a program but have no comparable non-enrollees. The ATT estimate applies only to the matched subset, which may differ from all treated units.

Generalizability: If many treated units are excluded, ask: "Can I generalize the treatment effect to the excluded cases?" If they're systematically different, the answer may be no.

COVARIATE BALANCE TABLE

Standardized Mean Differences

Covariate Mean (Treated) Mean (Control) SMD Before Mean (Matched Ctrl) SMD After % Reduction
Run matching to see balance diagnostics.

SMD interpretation: |SMD| < 0.1 = excellent balance (green), 0.1–0.25 = acceptable (yellow), > 0.25 = concerning (red).

About Standardized Mean Differences

Technical Interpretation

What SMD measures: The difference in means between treated and control groups, divided by the pooled standard deviation: SMD = (Mean_Treated - Mean_Control) / SD_pooled. This makes the measure unit-free and comparable across variables.

For categorical variables: Each category level is treated as a 0/1 indicator, and SMD is calculated for the proportion in each level.

% Reduction: Shows how much matching improved balance. 80%+ reduction is excellent; negative reduction means matching made things worse (rare, but possible with replacement).

Practical Interpretation

Reading the table: Compare the "SMD Before" and "SMD After" columns. Ideally, all "SMD After" values should be near zero (green). Covariates with high SMD after matching may still confound your treatment effect estimate.

Example: If "Prior_Spend" has SMD = 0.4 before matching, treated customers spent 0.4 SDs more than controls before the program. After matching (SMD = 0.05), the difference is negligible—you're now comparing customers with similar spending histories.

If balance is poor: Try adjusting matching settings (wider caliper, different covariates) or consider that selection bias may be too strong to overcome with this data.

TREATMENT EFFECT RESULTS

Average Treatment Effect on the Treated (ATT)

Estimated ATT: --
Standard Error: --
95% CI: --
t-statistic: --
p-value: --
Mean Outcome (Treated): --
Mean Outcome (Matched Control): --
Interpreting the ATT

Technical Interpretation

Average Treatment Effect on the Treated (ATT): The mean difference in outcomes between treated units and their matched controls: ATT = (1/n) Σ(Y_treated - Y_matched_control). This estimates the causal effect of treatment for those who received it.

Standard Error & 95% CI: The SE measures uncertainty in the ATT estimate. The 95% CI means we are 95% confident the true effect lies within this range. If the CI excludes zero, the effect is statistically significant at α = 0.05.

t-statistic & p-value: The t-stat is ATT/SE; p-value tests H₀: ATT = 0. p < 0.05 provides evidence of a non-zero treatment effect.

Practical Interpretation

What the number means: If ATT = 120 for a spending outcome, treated customers spent an average of $120 more than similar untreated customers. This is the estimated causal impact of the treatment.

Business decision-making: Compare the ATT to the cost of treatment. If a loyalty program costs $50/customer to run but generates ATT = $120 in additional spend, the program has positive ROI.

Caution: The ATT is valid only if (1) the propensity model includes all confounders, (2) matching achieved good balance, and (3) there are no unmeasured confounders. PSM cannot prove causation—it can only adjust for observed differences.

Effect size vs. significance: A large p-value with a meaningful ATT suggests insufficient sample size. A tiny p-value with a tiny ATT may be statistically significant but practically irrelevant. Always consider both.

Statistical Interpretation

Run matching analysis to see results.

Managerial Takeaway

Run matching analysis to see actionable insights.

The downloaded file includes the matched sample with propensity scores, match IDs, and outcomes.

METHODS COMPARISON: WHAT DID MATCHING BUY US?

Propensity score matching is most compelling when compared against simpler alternatives. The chart below shows three estimates of the treatment effect from the same data — revealing how much selection bias the naive comparison contains, and whether parametric OLS adjustment or PSM better handles it.

Understanding the Method Comparison

Why Three Methods?

Each approach makes different assumptions about selection bias and uses the data differently. Comparing them exposes how much of the raw effect was real vs. pre-existing group differences.

Naive Mean Difference: Simply compares treated vs. control group averages with no adjustment. Almost always the largest estimate — it conflates the treatment effect with pre-existing differences between who self-selected into treatment.

OLS with Covariates: Fits a linear regression with treatment + covariates as predictors. Adjusts for confounders parametrically (assumes linearity) and uses the full sample — even control units far outside the treated group's range. A sensible baseline but can extrapolate in ways that are hard to audit.

PSM ATT: Restricts the comparison to matched pairs only, enforcing the common support assumption. More transparent about which comparisons are being made. It explicitly discards unmatched units rather than extrapolating over them — at the cost of a smaller effective sample.

Reading the Forest Plot

Each row shows one method's point estimate (diamond) with 95% confidence interval (horizontal whiskers). The dashed vertical line at zero is the null (no effect). A CI that crosses zero means that method's estimate is not statistically significant at α = .05.

Interpreting the Pattern

If all three are similar: Selection bias was modest — the raw and adjusted estimates agree. Treatment effect is robust to method choice.

If PSM is much lower than naive: Selection bias was large. Treated units had favorable characteristics before treatment; the raw gap overstated the causal effect.

If OLS ≈ PSM but both < naive: The confounders you included capture most of the selection mechanism; both adjustment approaches agree.

If OLS and PSM diverge substantially: The linear OLS model may be misspecified, or treatment effects are heterogeneous across the propensity score distribution. PSM, by restricting to common support, is generally more defensible in this case.

PROPENSITY MODEL DETAILS

View Propensity Score Model Coefficients

The propensity score model predicts treatment assignment from covariates using logistic regression.

Run matching to see the propensity model.
Covariate Coefficient Std. Error z p-value Odds Ratio
Run matching to see coefficients.
Pseudo R²: -- Model χ²: -- p-value: --