k-Means Clustering Explorer

Segmentation tool Exploratory

Which customers behave alike — and what makes each group distinct? Upload numeric marketing data (CRM metrics, campaign responses, survey scores), choose how many segments to look for, and discover natural clusters that can drive targeting, positioning, and resource allocation.

OVERVIEW & KEY CONCEPTS

Customer segmentation is foundational to marketing strategy — it determines how budgets are allocated, which messages reach which audiences, and whether a campaign feels personalized or generic. K-means clustering is one of the most widely used algorithms for this task: it partitions observations into k groups so that customers within the same cluster are as similar as possible while clusters themselves are as distinct as possible. The result is a data-driven segmentation that replaces intuition with structure.

Formally, k-means minimizes the total within-cluster sum of squared distances:

$$\min_{\{C_j,\ \mu_j\}} \sum_{j=1}^{k} \sum_{x_i \in C_j} \lVert x_i - \mu_j \rVert^2$$

where $C_j$ is the set of observations assigned to cluster $j$ and $\mu_j$ is that cluster’s centroid (mean vector).

Key Concepts

Centroids: Each cluster is summarized by its centroid — the mean of all observations in that cluster across every feature. The algorithm iterates between assigning observations to the nearest centroid and recomputing centroids until assignments stabilize.
Feature scaling: K-means uses Euclidean distance, so features on larger scales dominate the clustering. Standardizing to z-scores (mean 0, sd 1) or rescaling to a common range ensures all variables contribute proportionally.
Choosing k: There is no single “correct” number of clusters — it is a judgment call informed by diagnostics (Elbow plot, Silhouette score) and business interpretability. A 4-cluster solution that maps to actionable personas may be more useful than a 7-cluster solution with marginally better fit.
Initialization sensitivity: Results can vary with random starting positions. This tool runs multiple initializations and keeps the best solution, but results are exploratory — not deterministic ground truth.

When to Use K-Means Clustering

K-means is well-suited to any setting where you have numeric measurements across a set of entities and want to discover natural groupings. Common marketing applications include:

Customer segmentation from CRM data — group customers by recency, frequency, monetary value (RFM), or behavioral and attitudinal metrics.
Campaign response grouping — identify which customers reacted similarly to a promotion, email cadence, or price change.
Product portfolio clustering — discover which products or SKUs compete in the same perceptual or performance space.
Survey-based persona development — group respondents by preference patterns, attitudes, or lifestyle indicators.
Market structure analysis — identify natural groupings in competitive or category-level data to inform positioning strategy.

When to consider alternatives: If your data is primarily categorical, very high-dimensional without prior reduction, or you suspect non-spherical cluster shapes, methods like hierarchical clustering, DBSCAN, or latent class analysis may be better suited.

DATA SOURCE

Load a marketing use case:

Use presets to load example segmentation datasets (such as customers with RFM metrics, email campaign performance, or product portfolio data). You can download the scenario data to tweak it in Excel or Numbers and then re-upload.

Upload numeric CSV data

Provide a header row and numeric columns only (up to 5,000 observations). After upload you can choose which variables to include in the clustering and which to plot on the chart.

Drag & Drop CSV file (.csv, .tsv, .txt, .xls, .xlsx)

Include headers with numeric columns for clustering analysis.

No file uploaded.

INPUTS & SETTINGS

Feature selection

Choose which numeric columns are used to form clusters.

Preprocessing & clustering

Feature scaling Number of clusters (k) Range for diagnostics (min–max k)

Additional info & guidance

Start with a small range of clusters (for example, 2–8) and look at the elbow and silhouette charts to decide on a reasonable value of k. Extremely large k can overfit noise and create tiny clusters that are hard to interpret.

You can change the standardization option to see how feature scaling impacts cluster assignments, which is especially important when some variables are on very different scales (for example, annual revenue vs. satisfaction scores).

VISUAL OUTPUT

Cluster map (scatterplot)

X: Y:

Each point is an observation colored by cluster membership, plotted using the selected x and y variables. Centroids are shown as larger markers. The underlying clustering uses all selected features (after any chosen scaling), but the axes on this chart always show the variables in their original units. This means the map is a 2D or 3D projection: clusters that overlap visually on these axes may still differ on other variables included in the model.

📐 Note on higher dimensions: Your clustering uses N features, but this chart shows only 2 or 3 of them. Clusters that appear overlapping here may be well-separated in the full feature space. The cluster assignments and centroids reflect all selected features.

Elbow chart (within-cluster variation)

Plots total within-cluster sum of squares (WCSS) versus k. Look for an “elbow” where additional clusters give diminishing returns.

Advanced visualization settings

Show only a random subset of points

Subsampling can make dense plots easier to read when you have many observations. When you show only a random subset of points, the clusters and centroids are still based on the full dataset.

Silhouette diagnostics

Shows the average silhouette score for the current choice of k. Silhouette values range from -1 to 1: values near 1 indicate observations are much closer to their own cluster than to others (clear, well-separated segments), values near 0 indicate overlapping or ambiguous boundaries, and negative values suggest some observations fit better in a different cluster.

CLUSTER SUMMARY

Number of clusters (k): –

Total within-cluster SS: –

Average silhouette (k): –

Largest cluster size: –

APA-Style Report

After you run k-means, this panel will describe the clustering solution in a more formal statistical style, including the number of clusters, within-cluster variation, and basic diagnostics.

Managerial Interpretation

This panel translates the clustering into plain language, highlighting how many customer or campaign segments were found, how they differ on key variables, and how actionable the segments appear.

Cluster Profile Views

These tabs show the same cluster solution through different lenses: exact values, shape, contrast, over/under-indexing, and overlap. Start with the profile table, then switch views to see what becomes easier to spot visually.

Use the table when you want exact values for size, average feature levels, within-cluster variability, and overall tightness.

Cluster	Size	Feature means	Within-cluster standard deviations	Avg distance to centroid
Run clustering to see cluster profiles here.

The downloaded file includes your original numeric columns plus a cluster_id for each observation and its distance_to_centroid (how far it sits from the center of its assigned cluster based on the selected features).

Index values are shown relative to the overall sample average for each feature, where 100 = average. Values above 100 over-index; values below 100 under-index.

Cluster	Size	Index values
Run clustering to see cluster index values here.

DIAGNOSTICS & ASSUMPTIONS

Diagnostics & assumptions

The tool will flag potential issues such as very small clusters, extreme outliers, or highly imbalanced cluster sizes. Use these diagnostics to decide whether to simplify the feature set, change k, or reconsider using k-means for this dataset.

k-Means Clustering Explorer

👨‍🏫 Professor Mode: Guided Learning Experience

OVERVIEW & KEY CONCEPTS

DATA SOURCE

Use a Case Study

Upload Your Data

Upload numeric CSV data

INPUTS & SETTINGS

Feature selection

Preprocessing & clustering

VISUAL OUTPUT

Cluster map (scatterplot)

Elbow chart (within-cluster variation)

Silhouette diagnostics

CLUSTER SUMMARY

APA-Style Report

Managerial Interpretation

Cluster Profile Views

DIAGNOSTICS & ASSUMPTIONS