MAE Model Calibration Lab

Loss Functions Regression Foundations

Manually calibrate linear and quadratic response models by minimizing Mean Absolute Error. Develop intuition for how loss functions guide model fitting before letting algorithms do the work.

OVERVIEW & LEARNING OBJECTIVES

Mean Absolute Error (MAE) is a loss function that measures the average absolute difference between predicted and actual values. By manually adjusting model parameters to minimize MAE, you'll understand what optimization algorithms do automatically—and develop intuition for why different model shapes fit different data patterns.

🎯 What You'll Learn

Loss functions in action: Watch how MAE changes as you adjust parameters. Lower MAE = better predictions!
Linear vs. Quadratic: Learn when a simple linear model is sufficient and when you need the flexibility of a quadratic to capture curvature.
Marketing applications: Explore three different marketing scenarios with distinctly different data patterns to see how model choice matters in practice.
Benchmark your intuition: Compare your manually-fitted MAE against the algorithmically optimal MAE to see how close you can get!

💡 Why This Matters: Every regression, machine learning model, and optimization algorithm uses a loss function to "learn" the best parameters. Understanding MAE builds the foundation for marketing mix modeling, attribution, and predictive analytics.

📐 Mathematical Foundations

Mean Absolute Error (MAE):

$$\text{MAE} = \frac{1}{N}\sum_{i=1}^{N}\left|y_i - \hat{y}_i\right| = \frac{1}{N}\sum_{i=1}^{N}\left|\text{Actual} - \text{Predicted}\right|$$

Linear Model:

$$\hat{y} = B_0 + B_1 \cdot X$$

Quadratic Model:

$$\hat{y} = B_0 + B_1 \cdot X + B_2 \cdot X^2$$

Parameter	Name	Interpretation
B₀	Intercept	Predicted Y when X = 0 (baseline)
B₁	Linear coefficient	Change in Y per unit increase in X (slope/rate)
B₂	Quadratic coefficient	Curvature—positive = accelerating, negative = diminishing returns

MARKETING SCENARIOS

Load a marketing use case:

Select a marketing scenario above to see the business context and variables, or use the default Search Ads dataset loaded below.

HOW TO USE THIS TOOL

🎚️ Adjust Parameters

Use the sliders for coarse adjustments or type precise values in the number inputs.

👀 Watch the MAE

The MAE updates in real-time. Your goal is to minimize it—lower is better!

📏 Read the Error Lines

Grey dotted lines show the absolute error for each data point. Shorter lines = better fit.

🔄 Compare Models

Try both linear and quadratic. Which achieves a lower MAE? What does that tell you?

LINEAR MODEL

🔵 Linear Fit: Y = B₀ + B₁ × X

B₀ (Intercept) 20

B₁ (Slope) 0.50

Model: REVENUE = 20.0 + 0.50 × AdSpending

MAE = --

💡 Interpreting Your Linear Model

QUADRATIC MODEL

🟠 Quadratic Fit: Y = B₀ + B₁ × X + B₂ × X²

B₀ (Intercept) 20

B₁ (Linear Term) 0.50

B₂ (Quadratic Term) 0.000100

Model: REVENUE = 20.0 + 0.50 × AdSpending + 0.000100 × AdSpending²

MAE = --

💡 Interpreting Your Quadratic Model

MODEL COMPARISON

Compare your manually-fitted models against the algorithmically optimal fit. How close can you get?

🔵 Linear Model

Your MAE: --

Optimal MAE: --

Gap: --

🟠 Quadratic Model

Your MAE: --

Optimal MAE: --

Gap: --

Adjust the parameters above to see which model fits better!

📊 How is "Optimal MAE" calculated?

The optimal MAE is found using L1 regression (Least Absolute Deviation), which directly minimizes Mean Absolute Error. Unlike ordinary least squares (OLS) which minimizes squared error, L1 regression finds the parameters that truly minimize the sum of absolute errors.

Since there's no closed-form solution for L1 regression, we use the Nelder-Mead simplex optimization algorithm—an iterative method that searches for the parameter values that minimize MAE without requiring calculus.

If your manual MAE is within 5-10% of optimal, you've done an excellent job! Getting exactly to optimal by hand is very difficult.

EXPLORATORY QUESTIONS

🤔 Think About These Questions

What Were You Actually Doing? As you moved the sliders, you were searching for parameters that minimize error. That's it. That's what "fitting a model" means—whether you do it by hand or let an algorithm do it. What did this process feel like? Tedious? Satisfying when MAE dropped?
Simpler is Better (When It Works): Try the Search Ads scenario. The linear model (2 parameters) probably fits almost as well as the quadratic (3 parameters). If the MAE difference is tiny, which model would you choose and why?
When Simplicity Fails: Now try the Email Frequency scenario. Can the linear model capture the "sweet spot" where revenue peaks? What happens to the linear model's predictions at the extremes (very few or very many emails)?
The Gap to Optimal: Compare your manually-fitted MAE to the algorithmically optimal MAE shown below. How close did you get? What does this tell you about why we let computers do this in practice?
The Big Picture: Every regression, every machine learning model, every neural network is fundamentally doing what you just did: adjusting parameters to minimize some measure of error. The math gets fancier, but the core idea doesn't change. Does that make these methods feel more approachable or more mysterious?

📚 Connecting to Broader Concepts

🎯 What You Just Did = What Algorithms Do

When you run lm() in R or LinearRegression() in Python, the computer is doing exactly what you did—searching for parameters that minimize error. It just does it faster and finds the exact optimum using calculus.

⚖️ Parsimony: The Preference for Simplicity

Scientists and analysts prefer simpler models when they fit the data equally well. Why? Simpler models are easier to interpret, less likely to overfit, and more likely to work on new data. This principle has a name: Occam's Razor.

📉 Loss Functions: The Universal Language

MAE is one "loss function." MSE (mean squared error) is another. Log-loss is used for classification. But they all serve the same purpose: giving the algorithm a single number to minimize. Pick your loss function, then minimize it.

🚀 From Here to Machine Learning

Neural networks have millions of parameters instead of 2-3, but they're still just minimizing a loss function. The difference is scale and the types of patterns they can capture—not the fundamental logic of what "learning" means.

👨‍🏫 Professor Mode: Guided Learning Experience