Theme Extractor
Automatically discover hidden themes and topics in customer reviews, survey responses, and open-ended feedback using machine learning.
HOW IT WORKS
This tool uses Non-negative Matrix Factorization (NMF) to identify latent themes in text data. Unlike manual coding, the algorithm automatically discovers patterns across hundreds of documents.
Text Preprocessing
Documents are cleaned, tokenized, and lemmatized. Stop words and common filler terms are removed.
TF-IDF Vectorization
Text is converted to numerical vectors based on term frequency and document importance.
Theme Discovery
NMF decomposes the matrix into themes, each represented by its most characteristic words.
Document Assignment
Each document is assigned to themes based on how strongly it matches each theme's pattern.
When to Use This Tool
- Customer Reviews: Discover what aspects of your product customers discuss most
- Survey Responses: Find patterns in open-ended feedback without manual coding
- Social Media: Identify conversation themes around your brand
- Support Tickets: Categorize issues automatically for better triage
Understanding NMF Topic Modeling
π What is NMF?
Non-negative Matrix Factorization (NMF) is a dimensionality reduction technique that decomposes your document-term matrix into two smaller matrices:
- W (Document-Theme matrix): Shows how much each document relates to each theme
- H (Theme-Term matrix): Shows which words define each theme
The "non-negative" constraint means all values are β₯ 0, making results interpretable as additive parts that combine to form documents.
π NMF vs. Other Methods
| Method | Strengths | Best For |
|---|---|---|
| NMF | Interpretable themes, fast, produces sparse topics | Marketing text, customer reviews, product feedback |
| LDA | Probabilistic, handles document length well | Large document collections, academic papers |
| LSA/SVD | Captures synonymy, mathematical foundations | Information retrieval, search engines |
β οΈ Limitations to Keep in Mind
- Bag of words: Word order is ignored; "not good" and "good" are treated similarly
- Pre-set theme count: You must specify (or auto-detect) the number of themes
- Static themes: Themes don't evolve over time or account for sequence
- Interpretability depends on data: Noisy or very diverse text may produce unclear themes
π‘ Tips for Better Results
- Homogeneous data: Analyze reviews of similar products together, not mixed categories
- Sufficient volume: 50+ documents recommended for meaningful themes
- Similar length: Very short tweets mixed with long reviews may skew results
- Domain-specific stop words: Common domain terms (e.g., "product," "service") may need filtering
Understanding TF-IDF Vectorization
π What is TF-IDF?
Term Frequency-Inverse Document Frequency measures how important a word is to a document in a collection:
- TF (Term Frequency): How often a word appears in a document (more = more important)
- IDF (Inverse Document Frequency): How rare a word is across all documents (rarer = more distinctive)
Formula: TF-IDF = TF Γ log(Total Documents / Documents with Term)
π Why TF-IDF Matters
Common words like "the," "is," "product" appear everywhere and don't help distinguish themes. TF-IDF automatically downweights these while highlighting distinctive vocabulary:
- High TF-IDF: Words that are frequent in specific documents but rare overall β good theme indicators
- Low TF-IDF: Words that appear everywhere or rarely anywhere β less useful for themes
π‘ Example
In electronics reviews:
- "battery" might have high TF-IDF in power-related reviews β strong theme signal
- "product" appears in almost every review β low TF-IDF, filtered out
- "phenomenal" rarely appears but is strong when it does β distinctive sentiment indicator
DATA SOURCE
Use a Case Study
π Pre-loaded Text Datasets
Select a preset scenario to explore real-world text data with theme extraction. Each scenario includes sample reviews, comments, or survey responses.
Enter Your Data
Max 500 documents Max 5,000 chars each
Paste your text data below. Each line will be treated as a separate document (review, comment, etc.).
Upload a CSV file with a text column. The tool will auto-detect columns containing text.
Drag & Drop CSV file (.csv, .tsv, .txt, .xls, .xlsx)
Upload a CSV with a text column. The tool will auto-detect text columns.
ANALYSIS SETTINGS
Analysis Settings
Auto-detect finds the optimal number based on your data.
More words give richer theme descriptions.
Choosing the Right Number of Themes
π Auto-Detection Method
When set to "Auto-detect," the algorithm tests different theme counts (2-8) and selects the one with the best coherence score. Coherence measures how semantically similar the top words in each theme are to each other.
- High coherence: Theme words relate to a single concept (e.g., "battery, charge, power, life")
- Low coherence: Theme words are scattered (e.g., "battery, shipping, color, return")
π When to Override Auto-Detection
| Situation | Recommended Action |
|---|---|
| Themes seem too broad/general | Increase theme count (more specific topics) |
| Similar themes detected | Decrease theme count (consolidate) |
| You know the domain well | Set count based on expected categories |
| Very small dataset (<100 docs) | Use 2-4 themes to avoid overfitting |
| Large, diverse dataset (>500 docs) | Try 5-8 themes for granularity |
π‘ Iterative Approach
Topic modeling often benefits from experimentation:
- Start with auto-detect to get a baseline
- Review the themesβdo they make business sense?
- If themes overlap, reduce count; if too broad, increase count
- Compare sample comments across runs to validate
Words per Theme: Quality vs. Specificity
π What This Controls
This setting determines how many top-weighted words are shown for each discovered theme. The words are ranked by their importance to that theme (NMF weight).
π Trade-offs
| Words | Pros | Cons |
|---|---|---|
| 5 words | Clean, focused, easy to label | May miss nuances |
| 8 words (default) | Good balance of clarity and detail | β |
| 10-12 words | Richer context, better for unfamiliar domains | Can include lower-relevance terms |
π‘ Recommendation
Start with 8 words (default). Increase to 10-12 if you're exploring unfamiliar data and need more context to understand themes. Decrease to 5 for clean presentation in reports.
Sample Comments by Theme
Most representative documents for each theme (highest confidence scores).
Understanding Sample Comments & Confidence Scores
π What Are Sample Comments?
These are the most representative documents for each themeβthe ones the algorithm is most confident about assigning to that theme. They serve as concrete examples of what each theme "looks like" in your data.
π Understanding Confidence Scores
The confidence percentage reflects how strongly a document matches its assigned theme:
π‘ How to Use Samples
β οΈ When Samples Don't Match Expectations
If sample comments seem wrong for a theme: