Estimate causal treatment effects from observational marketing data using propensity score matching. Upload data with a treatment indicator and outcome, select covariates for balance, and the tool performs nearest-neighbor matching to create comparable treated and control groups.
1✓
Upload Data
2✓
Configure Variables
3✓
Run Matching
4✓
Interpret Results
👨🏫 Professor Mode: Guided Learning Experience
New to propensity score matching? Enable Professor Mode for step-by-step guidance through matching methods, balance diagnostics, and interpreting causal treatment effects!
OVERVIEW & APPROACH
⚠
Here is the mistake that gets made in boardrooms every week:
A company launches a loyalty program. Enrolled members spend $1,840 per year on average.
Non-enrolled customers spend $1,210. The VP of Marketing opens a slide deck and announces:
"The loyalty program drives $630 in incremental annual spend per customer."
This analysis is almost certainly wrong — and the error is not arithmetic. It is causal.
Why the Simple Comparison Lies
Loyalty programs don't recruit customers randomly. Customers who choose to enroll tend to already be
the company's best, most engaged buyers — higher prior spend, more frequent visits, longer tenure.
They were going to spend more regardless of the program. When you compare enrolled vs. non-enrolled
customers, you are not comparing the effect of the program. You are comparing
two different kinds of people.
This is the selection bias problem. The treatment (enrollment) was not randomly assigned —
customers self-selected in. And the very characteristics that made them enroll are the same characteristics
that drive higher spending. So the naive gap of $630 is a mixture of two things that are impossible to
disentangle without a more careful approach:
Real program effect
The causal lift the program actually created — what you want to measure
Selection artifact
Pre-existing advantages of the type of customer who enrolled — noise you need to remove
A randomized experiment would solve this cleanly by assigning enrollment randomly. But in most real marketing
settings — CRM programs, ad exposure, promotional offers, app features — you simply cannot run a
randomized experiment after the fact. You have observational data about what happened in the world, and you
need to extract a credible causal answer from it.
The Core Idea: Compare Like With Like
Propensity score matching does not try to compare all treated customers to all untreated customers.
Instead, it asks: for each customer who enrolled, can we find a customer who did NOT enroll
but who looked nearly identical before the enrollment decision — same spending history,
same visit frequency, same demographics?
If we can construct those pairs, we are no longer comparing different types of people.
We are comparing the same type of person under two conditions. Any remaining outcome gap
is credibly attributable to the program itself.
Enrolled customers self-select — they look systematically different from non-enrolled.
Propensity scores compress all covariates into one matchable number. Each treated customer finds its nearest
statistical twin. Compare only within pairs — the gap is the causal effect, not selection bias.
How Propensity Score Matching Works
The challenge with "find a similar customer" is that people have many characteristics simultaneously.
Matching on ten variables at once quickly becomes impossible — there may be no pair that agrees
on everything. The propensity score collapses all that complexity into a single, matchable number.
Step 1 — Estimate the propensity score:
$$ e(X_i) = \Pr(T_i = 1 \mid X_i) = \text{logit}^{-1}(\beta_0 + \beta_1 X_{1i} + \cdots + \beta_p X_{pi}) $$
Fit a logistic regression that predicts who enrolled from their observable characteristics \(X\).
Each customer's fitted probability — their propensity score \(\hat{e}(X_i)\) —
is a summary of "how likely someone with this profile was to enroll." Two customers with the same score
are statistically equivalent on all the characteristics the model saw, even if their individual values differ.
Step 2 — Match on the score:
For each customer who enrolled (treated), find the non-enrolled customer (control) with the
closest propensity score. This creates matched pairs where both people looked equally likely
to enroll — making treatment assignment as good as random within each pair.
Step 3 — Estimate the treatment effect:
$$ \text{ATT} = \frac{1}{n_1} \sum_{i:\, T_i=1} \bigl[ Y_i - Y_{j(i)} \bigr] $$
The Average Treatment Effect on the Treated (ATT) is the mean outcome difference
within matched pairs. Because the pairs are comparable, this gap is the causal program effect —
not the selection artifact. If the loyalty program really does change behavior, it shows up here.
If the naive $630 gap shrinks dramatically after matching, that is evidence that most of it was
pre-existing customer quality, not the program working.
Where PSM is most valuable in marketing practice
PSM is the right tool whenever treatment was not randomly assigned and you need a causal answer rather than a correlation.
Common marketing applications include:
CRM & loyalty programs — did enrollment cause higher spend, or did high-spenders self-select in?
Email & push campaigns — did the campaign cause conversions, or were recipients already highly engaged?
Ad exposure — did the ad lift brand awareness, or were exposed users already category enthusiasts?
App feature adoption — does using Feature X cause retention, or do retained users just happen to discover it?
Sales interventions — did the rep's call cause the deal to close, or were those accounts already warm?
What you need: (1) a binary treatment indicator (enrolled vs. not, received vs. not), (2) an outcome to
measure, and (3) pre-treatment covariates that predict who self-selected into treatment.
The richer your covariate set, the more selection bias you can remove.
Key limitation: PSM only removes bias from observed characteristics. If customers who enrolled
also had unmeasured motivation (e.g., stronger brand affinity you never surveyed), that hidden factor
still confounds your estimate. PSM makes the comparison as clean as your data allows — but it
cannot manufacture a randomized experiment from observational records. This is why domain knowledge
about why people self-select matters as much as the method itself.
ATT vs. ATE: which effect are you actually estimating?
PSM with 1:1 matching estimates the Average Treatment Effect on the Treated (ATT):
among the customers who actually enrolled, what was the average effect compared to their matched
non-enrollees? This is the most actionable causal question for most marketing decisions,
because it reflects the program's impact on the people it actually reached.
The Average Treatment Effect (ATE) asks a different question: what would the effect be
on a randomly selected person from the full population, whether or not they would normally enroll?
ATE is relevant when you are considering rolling the program out to everyone, including people who
would never have enrolled voluntarily. PSM typically estimates ATT, not ATE — keep this
distinction in mind when generalizing results.
🚧
Reality Check: Why Marketers Sometimes Resist This Method
Here is something most textbooks don’t tell you: propensity score matching will
almost always produce a smaller effect estimate than the naive comparison — and
that is politically inconvenient.
The mechanism is direct. The naive estimate is inflated by selection bias: your best, most engaged
customers self-selected into the program, so they were going to perform better anyway. PSM strips
that inflation away. What remains is the causal effect of the program itself —
which is smaller, more honest, and harder to put in an exec deck.
In practice, this dynamic surfaces in recognizable ways:
“We already know the program works.” — Resistance to running the analysis at all.
Buried results. — The analysis gets done; the smaller number gets quietly omitted from the presentation.
Methodological negotiation. — Pressure to try a wider caliper, drop covariates, or “try a different model” until the number looks better.
Competing decks. — The naive number keeps circulating in marketing slides while the PSM estimate lives in a footnote.
This is not a reason to avoid PSM — it is a reason to understand what you are up against
when you use it. A correctly smaller number is more valuable than an impressive wrong one.
The analyst who can explain why the PSM estimate is smaller, and defend it under
pressure, is doing something genuinely difficult and genuinely useful.
DATA SOURCE
📚
Use a Case Study
Use presets to explore causal inference scenarios, such as loyalty program enrollment or email campaign exposure.
Each scenario provides a raw data file with treatment, outcome, and covariates.
📚Use a Case Studyclick to switch
📤
Upload Your Data
Upload a CSV file with case-level data including: a binary treatment indicator (0/1), an outcome variable, and covariates for matching. Headers are required.
Drag & Drop raw data file (.csv, .tsv, .txt, .xls, .xlsx)
Include headers; must have treatment (binary), outcome, and covariate columns.
No file uploaded.
📤Upload Your Dataclick to switch
DATA & VARIABLE SELECTION
Assign Variables for Matching
Select your treatment indicator, outcome variable, and covariates to balance on.
Can be binary (0/1) or continuous. The ATT will measure the treatment effect on this variable.
Select variables that may confound the treatment-outcome relationship.
Matching Options
Maximum PS difference for a valid match. 0.25 is recommended; increase if few matches are found.
Allow control units to be matched to multiple treated units.
PROPENSITY SCORE DIAGNOSTICS
Propensity Score Distribution
Interpretation Aid
Technical Interpretation
What you're seeing: The histograms show the distribution of estimated propensity scores for treated (blue) and control (orange) groups. Each propensity score represents the predicted probability that a unit received treatment, based on its covariate values.
Good overlap: When both distributions cover similar ranges (e.g., both spanning 0.2–0.8), matching can find comparable units. The key assumption of PSM—that we can find similar treated and control units—is satisfied.
Warning signs: If treated units cluster near 1.0 and controls near 0.0 with little overlap, the propensity model is "perfectly predicting" treatment. This means groups are fundamentally different on observed characteristics, and matching will fail or produce few matches.
Practical Interpretation
Marketing example: In a loyalty program analysis, if customers who enrolled all have very high propensity scores while non-enrollees have very low scores, it means enrolled customers were systematically different (e.g., higher prior spend, more visits). Finding a "fair comparison" will be difficult.
What to do if overlap is poor: (1) Use fewer or different covariates, (2) collect more data, or (3) consider that the treatment effect may not be estimable with this data—the groups are too different to compare.
Covariate Balance: Before vs. After Matching
Interpretation Aid
Technical Interpretation
Standardized Mean Difference (SMD): Each point shows how different treated and control groups are on a covariate, measured in pooled standard deviation units. An SMD of 0.2 means the groups differ by 0.2 standard deviations on that variable.
Balance thresholds: SMD < 0.1 indicates excellent balance (green zone). SMD 0.1–0.25 is acceptable but shows residual imbalance (yellow). SMD > 0.25 suggests meaningful differences that may bias the ATT estimate (red).
Before vs. After: Open circles show pre-matching imbalance; filled circles show post-matching balance. Successful matching moves all points toward zero (the center dashed line).
Practical Interpretation
What it tells you: This "Love Plot" shows whether matching created comparable groups. If all filled circles (after matching) are within the ±0.1 bands, you can be confident that treated and matched control groups are similar on observed characteristics.
Marketing example: If "Prior_Spend" has SMD = 0.5 before matching (treated spent more) but SMD = 0.05 after matching, the matched comparison controls for prior spending behavior. The ATT now compares customers with similar spending histories.
Action: If any covariate remains imbalanced (SMD > 0.25) after matching, consider adding it to the outcome model or interpreting results cautiously—remaining confounding may bias the effect estimate.
MATCHING SUMMARY
Treated Units--
Control Units--
Matched Pairs--
Unmatched Treated--
Mean PS (Treated)--
Mean PS (Matched Control)--
Understanding Matching Quality
Technical Interpretation
Matched pairs: The number of treated units successfully matched to control units within the caliper threshold. This determines the effective sample size for ATT estimation.
Unmatched treated: Treated units outside the "common support" region—they have propensity scores too extreme to find comparable controls. High unmatched rates (>20%) suggest limited overlap.
Mean propensity scores: After matching, mean PS should be nearly identical between treated and matched controls. Differences > 0.05 suggest residual imbalance.
Practical Interpretation
Sample size trade-off: PSM often discards observations (unmatched treated or unused controls). A match rate of 80%+ is typically good; below 50% suggests the groups may be too different for meaningful comparison.
Who gets excluded: Unmatched treated units are often "extreme" cases—e.g., customers with very high engagement who enrolled in a program but have no comparable non-enrollees. The ATT estimate applies only to the matched subset, which may differ from all treated units.
Generalizability: If many treated units are excluded, ask: "Can I generalize the treatment effect to the excluded cases?" If they're systematically different, the answer may be no.
What SMD measures: The difference in means between treated and control groups, divided by the pooled standard deviation: SMD = (Mean_Treated - Mean_Control) / SD_pooled. This makes the measure unit-free and comparable across variables.
For categorical variables: Each category level is treated as a 0/1 indicator, and SMD is calculated for the proportion in each level.
% Reduction: Shows how much matching improved balance. 80%+ reduction is excellent; negative reduction means matching made things worse (rare, but possible with replacement).
Practical Interpretation
Reading the table: Compare the "SMD Before" and "SMD After" columns. Ideally, all "SMD After" values should be near zero (green). Covariates with high SMD after matching may still confound your treatment effect estimate.
Example: If "Prior_Spend" has SMD = 0.4 before matching, treated customers spent 0.4 SDs more than controls before the program. After matching (SMD = 0.05), the difference is negligible—you're now comparing customers with similar spending histories.
If balance is poor: Try adjusting matching settings (wider caliper, different covariates) or consider that selection bias may be too strong to overcome with this data.
TREATMENT EFFECT RESULTS
Average Treatment Effect on the Treated (ATT)
Estimated ATT:--
Standard Error:--
95% CI:--
t-statistic:--
p-value:--
Mean Outcome (Treated):--
Mean Outcome (Matched Control):--
Interpreting the ATT
Technical Interpretation
Average Treatment Effect on the Treated (ATT): The mean difference in outcomes between treated units and their matched controls: ATT = (1/n) Σ(Y_treated - Y_matched_control). This estimates the causal effect of treatment for those who received it.
Standard Error & 95% CI: The SE measures uncertainty in the ATT estimate. The 95% CI means we are 95% confident the true effect lies within this range. If the CI excludes zero, the effect is statistically significant at α = 0.05.
t-statistic & p-value: The t-stat is ATT/SE; p-value tests H₀: ATT = 0. p < 0.05 provides evidence of a non-zero treatment effect.
Practical Interpretation
What the number means: If ATT = 120 for a spending outcome, treated customers spent an average of $120 more than similar untreated customers. This is the estimated causal impact of the treatment.
Business decision-making: Compare the ATT to the cost of treatment. If a loyalty program costs $50/customer to run but generates ATT = $120 in additional spend, the program has positive ROI.
Caution: The ATT is valid only if (1) the propensity model includes all confounders, (2) matching achieved good balance, and (3) there are no unmeasured confounders. PSM cannot prove causation—it can only adjust for observed differences.
Effect size vs. significance: A large p-value with a meaningful ATT suggests insufficient sample size. A tiny p-value with a tiny ATT may be statistically significant but practically irrelevant. Always consider both.
Statistical Interpretation
Run matching analysis to see results.
Managerial Takeaway
Run matching analysis to see actionable insights.
The downloaded file includes the matched sample with propensity scores, match IDs, and outcomes.
METHODS COMPARISON: WHAT DID MATCHING BUY US?
Propensity score matching is most compelling when compared against simpler alternatives. The chart below shows three estimates of the treatment effect from the same data — revealing how much selection bias the naive comparison contains, and whether parametric OLS adjustment or PSM better handles it.
Run matching analysis to see method comparison.
Understanding the Method Comparison
Why Three Methods?
Each approach makes different assumptions about selection bias and uses the data differently. Comparing them exposes how much of the raw effect was real vs. pre-existing group differences.
Naive Mean Difference: Simply compares treated vs. control group averages with no adjustment. Almost always the largest estimate — it conflates the treatment effect with pre-existing differences between who self-selected into treatment.
OLS with Covariates: Fits a linear regression with treatment + covariates as predictors. Adjusts for confounders parametrically (assumes linearity) and uses the full sample — even control units far outside the treated group's range. A sensible baseline but can extrapolate in ways that are hard to audit.
PSM ATT: Restricts the comparison to matched pairs only, enforcing the common support assumption. More transparent about which comparisons are being made. It explicitly discards unmatched units rather than extrapolating over them — at the cost of a smaller effective sample.
Reading the Forest Plot
Each row shows one method's point estimate (diamond) with 95% confidence interval (horizontal whiskers). The dashed vertical line at zero is the null (no effect). A CI that crosses zero means that method's estimate is not statistically significant at α = .05.
Interpreting the Pattern
If all three are similar: Selection bias was modest — the raw and adjusted estimates agree. Treatment effect is robust to method choice.
If PSM is much lower than naive: Selection bias was large. Treated units had favorable characteristics before treatment; the raw gap overstated the causal effect.
If OLS ≈ PSM but both < naive: The confounders you included capture most of the selection mechanism; both adjustment approaches agree.
If OLS and PSM diverge substantially: The linear OLS model may be misspecified, or treatment effects are heterogeneous across the propensity score distribution. PSM, by restricting to common support, is generally more defensible in this case.
PROPENSITY MODEL DETAILS
View Propensity Score Model Coefficients
The propensity score model predicts treatment assignment from covariates using logistic regression.