Theme Extractor

Text Analysis NLP

Automatically discover hidden themes and topics in customer reviews, survey responses, and open-ended feedback using machine learning.

HOW IT WORKS

This tool uses Non-negative Matrix Factorization (NMF) to identify latent themes in text data. Unlike manual coding, the algorithm automatically discovers patterns across hundreds of documents.

Text Preprocessing

Documents are cleaned, tokenized, and lemmatized. Stop words and common filler terms are removed.

TF-IDF Vectorization

Text is converted to numerical vectors based on term frequency and document importance.

Theme Discovery

NMF decomposes the matrix into themes, each represented by its most characteristic words.

Document Assignment

Each document is assigned to themes based on how strongly it matches each theme's pattern.

When to Use This Tool

Customer Reviews: Discover what aspects of your product customers discuss most
Survey Responses: Find patterns in open-ended feedback without manual coding
Social Media: Identify conversation themes around your brand
Support Tickets: Categorize issues automatically for better triage

Understanding NMF Topic Modeling

📊 What is NMF?

Non-negative Matrix Factorization (NMF) is a dimensionality reduction technique that decomposes your document-term matrix into two smaller matrices:

W (Document-Theme matrix): Shows how much each document relates to each theme
H (Theme-Term matrix): Shows which words define each theme

The "non-negative" constraint means all values are ≥ 0, making results interpretable as additive parts that combine to form documents.

📈 NMF vs. Other Methods

Method	Strengths	Best For
NMF	Interpretable themes, fast, produces sparse topics	Marketing text, customer reviews, product feedback
LDA	Probabilistic, handles document length well	Large document collections, academic papers
LSA/SVD	Captures synonymy, mathematical foundations	Information retrieval, search engines

⚠️ Limitations to Keep in Mind

Bag of words: Word order is ignored; "not good" and "good" are treated similarly
Pre-set theme count: You must specify (or auto-detect) the number of themes
Static themes: Themes don't evolve over time or account for sequence
Interpretability depends on data: Noisy or very diverse text may produce unclear themes

💡 Tips for Better Results

Homogeneous data: Analyze reviews of similar products together, not mixed categories
Sufficient volume: 50+ documents recommended for meaningful themes
Similar length: Very short tweets mixed with long reviews may skew results
Domain-specific stop words: Common domain terms (e.g., "product," "service") may need filtering

Understanding TF-IDF Vectorization

📊 What is TF-IDF?

Term Frequency-Inverse Document Frequency measures how important a word is to a document in a collection:

TF (Term Frequency): How often a word appears in a document (more = more important)
IDF (Inverse Document Frequency): How rare a word is across all documents (rarer = more distinctive)

Formula: TF-IDF = TF × log(Total Documents / Documents with Term)

📈 Why TF-IDF Matters

Common words like "the," "is," "product" appear everywhere and don't help distinguish themes. TF-IDF automatically downweights these while highlighting distinctive vocabulary:

High TF-IDF: Words that are frequent in specific documents but rare overall → good theme indicators
Low TF-IDF: Words that appear everywhere or rarely anywhere → less useful for themes

💡 Example

In electronics reviews:

"battery" might have high TF-IDF in power-related reviews → strong theme signal
"product" appears in almost every review → low TF-IDF, filtered out
"phenomenal" rarely appears but is strong when it does → distinctive sentiment indicator

UPLOAD YOUR TEXT DATA

MARKETING SCENARIOS

Load a marketing use case:

Data Input

Max 500 documents Max 5,000 chars each

Paste your text data below. Each line will be treated as a separate document (review, comment, etc.).

0 documents 0 characters

Upload a CSV file with a text column. The tool will auto-detect columns containing text.

Drag & Drop CSV file (.csv, .tsv, .txt)

Upload a CSV with a text column. The tool will auto-detect text columns.

No file uploaded.

Analysis Settings

Number of Themes

Auto-detect finds the optimal number based on your data.

Words per Theme

More words give richer theme descriptions.

Choosing the Right Number of Themes

📊 Auto-Detection Method

When set to "Auto-detect," the algorithm tests different theme counts (2-8) and selects the one with the best coherence score. Coherence measures how semantically similar the top words in each theme are to each other.

High coherence: Theme words relate to a single concept (e.g., "battery, charge, power, life")
Low coherence: Theme words are scattered (e.g., "battery, shipping, color, return")

📈 When to Override Auto-Detection

Situation	Recommended Action
Themes seem too broad/general	Increase theme count (more specific topics)
Similar themes detected	Decrease theme count (consolidate)
You know the domain well	Set count based on expected categories
Very small dataset (<100 docs)	Use 2-4 themes to avoid overfitting
Large, diverse dataset (>500 docs)	Try 5-8 themes for granularity

💡 Iterative Approach

Topic modeling often benefits from experimentation:

Start with auto-detect to get a baseline
Review the themes—do they make business sense?
If themes overlap, reduce count; if too broad, increase count
Compare sample comments across runs to validate

Words per Theme: Quality vs. Specificity

📊 What This Controls

This setting determines how many top-weighted words are shown for each discovered theme. The words are ranked by their importance to that theme (NMF weight).

📈 Trade-offs

Words	Pros	Cons
5 words	Clean, focused, easy to label	May miss nuances
8 words (default)	Good balance of clarity and detail	—
10-12 words	Richer context, better for unfamiliar domains	Can include lower-relevance terms

💡 Recommendation

Start with 8 words (default). Increase to 10-12 if you're exploring unfamiliar data and need more context to understand themes. Decrease to 5 for clean presentation in reports.

DISCOVERED THEMES

-- Documents Analyzed

-- Themes Found

-- Unique Terms

-- Tokens Processed

-- Avg Doc Length

-- Avg Confidence

Understanding These Statistics

📊 Metric Explanations

Metric	What It Means	Good Values
Documents Analyzed	Total number of text items processed	50+ for reliable themes; 200+ for robust analysis
Themes Found	Number of distinct topics discovered	Typically 3-6; depends on data diversity
Unique Terms	Vocabulary size after preprocessing	100-1000 typical; varies by data
Tokens Processed	Total words analyzed (after filtering)	More is generally better for theme stability
Avg Doc Length	Mean tokens per document	5-50 tokens typical; very short docs may lack signal
Avg Confidence	Mean theme assignment strength (0-100%)	60%+ indicates clear theme assignments

⚠️ Warning Signs

Very few unique terms (<50): Data may be too short or repetitive
Low avg confidence (<40%): Themes may be overlapping or data is ambiguous
Avg doc length <5 tokens: Documents may be too short for reliable analysis

How to Interpret Theme Cards

📊 Theme Card Elements

Theme Label: Auto-generated label based on top words; rename mentally to match your domain
Document Count: How many documents primarily belong to this theme
Top Words: Most characteristic vocabulary for this theme, ordered by weight
Weight Bar: Relative importance—wider bars = more defining for the theme

📈 Reading the Word Weights

Words are ranked by their NMF weight within the theme:

First 2-3 words: Core defining terms—these anchor the theme's meaning
Middle words: Supporting context—help refine the theme's scope
Last words: Peripheral—may overlap with other themes

Focus on the first 3-4 words when labeling themes for business presentations.

💡 Naming Your Themes

Auto-generated labels use top words, but you should create meaningful names:

Battery, charge, power, life → "Battery & Power"
Ship, delivery, fast, arrive → "Shipping Experience"
Price, expensive, worth, money → "Value for Money"
Easy, simple, intuitive, learn → "Ease of Use"

⚠️ Theme Quality Indicators

Observation	What It Means	Action
Words don't relate to each other	Low coherence; theme is "garbage"	Try fewer themes or cleaner data
Two themes have similar words	Themes overlap	Reduce theme count
Theme has very few documents	Outlier theme or over-segmentation	Consider merging or reducing count
One theme has 80%+ of documents	Data lacks diversity	May need more themes or different data

Theme Distribution

Reading the Distribution Chart

📊 What This Shows

The bar chart displays the percentage of documents assigned to each theme. Documents are assigned to the theme with the highest confidence score.

📈 Interpreting the Distribution

Even distribution: Themes are well-balanced; data covers diverse topics equally
Skewed distribution: One or two dominant topics; consider if this reflects reality or a data issue
Very small themes (<5%): May be edge cases or noise; review sample comments to validate

💡 Grouping Analysis

When grouping is enabled (e.g., by product type or customer segment), the chart shows:

Bars clustered by theme with colors representing different groups
Compare how themes distribute across groups
Identify themes that are dominant in specific segments

Sample Comments by Theme

Show top per theme

Most representative documents for each theme (highest confidence scores).

Understanding Sample Comments & Confidence Scores

📊 What Are Sample Comments?

These are the most representative documents for each theme—the ones the algorithm is most confident about assigning to that theme. They serve as concrete examples of what each theme "looks like" in your data.

📈 Understanding Confidence Scores

The confidence percentage reflects how strongly a document matches its assigned theme:

Confidence	Interpretation
80-100%	Very strong match; document clearly belongs to this theme
60-79%	Good match; may have some secondary theme presence
40-59%	Moderate match; document spans multiple themes
<40%	Weak match; assignment is uncertain

💡 How to Use Samples

Validate themes: Do the sample comments make sense for the theme words?
Name themes: Reading samples helps you create meaningful labels
Spot issues: If samples seem unrelated to theme words, the theme may be incoherent
Understand edge cases: Lower-confidence samples show the theme's boundaries

⚠️ When Samples Don't Match Expectations

If sample comments seem wrong for a theme:

The theme may be capturing something unexpected (look at all words, not just top 2-3)
Try increasing the number of themes to separate concepts
Check if common domain words need filtering
Very short documents may be assigned based on limited signal

How Scoring Works

These examples show how the NMF algorithm processes text. Each word (token) is analyzed for its relevance to each theme. Words that strongly match a theme are highlighted in that theme's color. The final theme assignment is based on the combined weights of all tokens.

Token Legend: Strong theme match Medium match Weak/No match Filtered (stopword)

Deep Dive: The NMF Scoring Process

📊 Step-by-Step Scoring

Tokenization: Text is split into individual words (tokens)
Preprocessing: Stop words, punctuation, and very common terms are removed
TF-IDF Lookup: Each remaining token gets its TF-IDF weight from the vocabulary
Theme Weights: The token's TF-IDF vector is multiplied by each theme's word loadings
Aggregation: All token weights are summed to get total theme scores for the document
Normalization: Scores are converted to percentages (confidence) for assignment

📈 Token Highlighting Explained

Highlight	Meaning	Example
Strong	Token is a top word in the assigned theme	"battery" in a Power theme
Medium	Token contributes to a theme but isn't a top word	"lasts" in a Power theme
Weak	Token is in vocabulary but has low theme relevance	"the" (common), "unique" (rare)
Filtered	Token was removed during preprocessing	"a", "the", "is", "and"

💡 Why Some Words Are Filtered

The algorithm removes words that don't help distinguish themes:

Stop words: "the", "is", "and", "a" appear in every document
Too common: Words in 90%+ of documents don't differentiate
Too rare: Words in only 1-2 documents may be noise
Numbers/punctuation: Usually not meaningful for theme detection

⚠️ Limitations of Token-Level View

The highlighting shows simplified match indicators. The actual algorithm considers:

Continuous weights (not just high/medium/low categories)
Interactions between words in context
Document length normalization
Relative importance across all themes simultaneously

Use this view to build intuition, but trust the overall confidence score for assignments.

Export Results

What's in Each Export File?

📊 Themes Summary (CSV)

Contains one row per theme with:

Theme ID: Numeric identifier (1, 2, 3...)
Top Words: Comma-separated list of characteristic terms
Document Count: Number of documents assigned to this theme
Percentage: Share of total documents

Use for: Executive summaries, presentations, theme labeling reference

📈 Document Assignments (CSV)

Contains one row per document with:

Document ID: Row number from your original data
Original Text: The full text of the document
Assigned Theme: Primary theme ID
Confidence: Assignment confidence percentage
Theme_1_Score, Theme_2_Score, ...: Raw scores for each theme

Use for: Further analysis in Excel/R/Python, filtering by theme, joining with other data

💡 Analysis Ideas

Join assignments with customer metadata (segment, region, NPS score)
Filter low-confidence assignments for manual review
Track theme distribution over time periods
Compare themes across product lines or customer segments

👨‍🏫 Professor Mode: Guided Learning Experience

HOW IT WORKS

Text Preprocessing

TF-IDF Vectorization

Theme Discovery

Document Assignment

📊 What is NMF?

📈 NMF vs. Other Methods

⚠️ Limitations to Keep in Mind

💡 Tips for Better Results

📊 What is TF-IDF?

📈 Why TF-IDF Matters

💡 Example

UPLOAD YOUR TEXT DATA

MARKETING SCENARIOS

Data Input

Analysis Settings

📊 Auto-Detection Method

📈 When to Override Auto-Detection

💡 Iterative Approach

📊 What This Controls

📈 Trade-offs

💡 Recommendation

DISCOVERED THEMES

📊 Metric Explanations

⚠️ Warning Signs

📊 Theme Card Elements

📈 Reading the Word Weights

💡 Naming Your Themes

⚠️ Theme Quality Indicators

Theme Distribution

📊 What This Shows

📈 Interpreting the Distribution

💡 Grouping Analysis

Sample Comments by Theme

📊 What Are Sample Comments?

📈 Understanding Confidence Scores

💡 How to Use Samples

⚠️ When Samples Don't Match Expectations

How Scoring Works

📊 Step-by-Step Scoring

📈 Token Highlighting Explained

💡 Why Some Words Are Filtered

⚠️ Limitations of Token-Level View

Export Results

📊 Themes Summary (CSV)

📈 Document Assignments (CSV)

💡 Analysis Ideas