Topic modeling remains a critical tool in the AI and NLP toolbox. While large language models (LLMs) handle text exceptionally well, extracting high-level topics from massive datasets still requires dedicated topic modeling techniques. A typical workflow includes four core steps: embedding, dimensionality reduction, clustering, and topic representation.
frameworks today is BERTopic, which simplifies each stage with modular components and an intuitive API. In this post, I’ll walk through practical adjustments you can make to improve clustering outcomes and boost interpretability based on hands-on experiments using the open-source 20 Newsgroups dataset, which is distributed under the Creative Commons Attribution 4.0 International license.
Project Overview
We’ll start with the default settings recommended in BERTopic’s documentation and progressively update specific configurations to highlight their effects. Along the way, I’ll explain the purpose of each module and how to make informed decisions when customizing them.
Dataset Preparation
We load a sample of 500 news documents.
import random
from datasets import load_dataset
dataset = load_dataset("SetFit/20_newsgroups")
random.seed(42)
text_label = list(zip(dataset["train"]["text"], dataset["train"]["label_text"]))
text_label_500 = random.sample(text_label, 500)
Since the data originates from casual Usenet discussions, we apply cleaning steps to strip headers, remove clutter, and preserve only informative sentences.
This preprocessing ensures higher-quality embeddings and a smoother downstream clustering process.
import re
def clean_for_embedding(text, max_sentences=5):
lines = text.split("n")
lines = [line for line in lines if not line.strip().startswith(">")]
lines = [line for line in lines if not re.match
(r"^s*(from|subject|organization|lines|writes|article)s*:", line, re.IGNORECASE)]
text = " ".join(lines)
text = re.sub(r"s+", " ", text).strip()
text = re.sub(r"[!?]{3,}", "", text)
sentence_split = re.split(r'(?<=[.!?]) +', text)
sentence_split = [
s for s in sentence_split
if len(s.strip()) > 15 and not s.strip().isupper()
]
return " ".join(sentence_split[:max_sentences])
texts_clean = [clean_for_embedding(text) for text,_ in text_label_500]
labels = [label for _, label in text_label_500]
Initial BERTopic Pipeline
Using BERTopic’s modular design, we configure each component: SentenceTransformer for embeddings, UMAP for dimensionality reduction, HDBSCAN for clustering, and CountVectorizer + KeyBERT for topic representation. This setup yields only a few broad topics with noisy representations, highlighting the need for fine-tuning to achieve more coherent results.
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired
# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=10, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")
# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()
# Step 6 - (Optional) Fine-tune topic representations with
# a `bertopic.representation` model
representation_model = KeyBERTInspired()
# All steps together
topic_model = BERTopic(
embedding_model=embedding_model, # Step 1 - Extract embeddings
umap_model=umap_model, # Step 2 - Reduce dimensionality
hdbscan_model=hdbscan_model, # Step 3 - Cluster reduced embeddings
vectorizer_model=vectorizer_model, # Step 4 - Tokenize topics
ctfidf_model=ctfidf_model, # Step 5 - Extract topic words
representation_model=representation_model # Step 6 - (Optional) Fine-tune topic representations
)
topics, probs = topic_model.fit_transform(texts_clean)
This setup yields only a few broad topics with noisy representations. This result highlights the need for finetuning to achieve more coherent results.
Parameter Tuning for Granular Topics
n_neighbors from UMAP module
UMAP is the dimensionality reduction module to reduce origin embedding to a lower dimension dense vector. By adjusting UMAP’s n_neighbors, we control how locally or globally the data is interpreted during dimensionality reduction. Lowering this value uncovers finer-grained clusters and improves topic distinctiveness.
umap_model_new = UMAP(n_neighbors=5, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model.umap_model = umap_model_new
topics, probs = topic_model.fit_transform(texts_clean)
topic_model.get_topic_info()

min_cluster_size and cluster_selection_method from HDBSCAN module
HDBSCAN is the clustering module set by default for BerTopic. By modifying HDBSCAN’s min_cluster_size and switching the cluster_selection_method from “eom” to “leaf” further sharpens topic resolution. These settings help uncover smaller, more focused themes and balance the distribution across clusters.
hdbscan_model_leaf = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='leaf', prediction_data=True)
topic_model.hdbscan_model = hdbscan_model_leaf
topics, _ = topic_model.fit_transform(texts_clean)
topic_model.get_topic_info()
The number of clusters increases to 30 by setting cluster_selection_method to leaf and min_cluster_size to 5.

Controlling Randomness for Reproducibility
UMAP is inherently non-deterministic, meaning it can produce different results on each run unless you explicitly set a fixed random_state. This detail is often omitted in example code, so be sure to include it to ensure reproducibility.
Similarly, if you’re using a third-party embedding API (like OpenAI), be cautious. Some APIs introduce slight variations on repeated calls. For reproducible outputs, cache embeddings and feed them directly into BERTopic.
from bertopic.backend import BaseEmbedder
import numpy as np
class CustomEmbedder(BaseEmbedder):
"""Light-weight wrapper to call NVIDIA's embedding endpoint via OpenAI SDK."""
def __init__(self, embedding_model, client):
super().__init__()
self.embedding_model = embedding_model
self.client = client
def encode(self, documents): # type: ignore[override]
response = self.client.embeddings.create(
input=documents,
model=self.embedding_model,
encoding_format="float",
extra_body={"input_type": "passage", "truncate": "NONE"},
)
embeddings = np.array([embed.embedding for embed in response.data])
return embeddings
topic_model.embedding_model = CustomEmbedder()
topics, probs = topic_model.fit_transform(texts_clean, embeddings=embeddings)
Every dataset domain may require different clustering settings for optimal results. To streamline experimentation, consider defining evaluation criteria and automating the tuning process. For this tutorial, we’ll use the cluster configuration that sets n_neighbors to 5, min_cluster_size to 5, and cluster_selection_method to “eom”. This is a combination that strikes a balance between granularity and coherence.
Improving Topic Representations
Representation plays a crucial role in making clusters interpretable. By default, BERTopic generates unigram-based representations, which often lack sufficient context. In the next section, we’ll explore several techniques to enrich these representations and improve topic interpretability.
Ngram
n-gram range
In BERTopic, CountVectorizer is the default tool to convert text data into bag-of-words representations. Instead of relying on generic unigrams, switch to bigrams or trigrams using ngram_range in CountVectorizer. This simple change adds much needed context.
Since we are only updating representation, BerTopic offers the update_topics function to avoid redoing the modeling all over again.
topic_model.update_topics(texts_clean, vectorizer_model=CountVectorizer(stop_words="english", ngram_range=(2,3)))
topic_model.get_topic_info()

Custom Tokenizer
Some bigrams are still hard to interpret e.g. 486dx 50, ac uk, dxf doc,… For greater control, implement a custom tokenizer that filters n-grams based on part-of-speech patterns. This removes meaningless combinations and elevates the quality of your topic keywords.
import spacy
from typing import List
class ImprovedTokenizer:
def __init__(self):
self.nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
self.MEANINGFUL_BIGRAMS = {
("ADJ", "NOUN"),
("NOUN", "NOUN"),
("VERB", "NOUN"),
}
# Keep only the most meaningful syntactic bigram patterns
def __call__(self, text: str, max_tokens=200) -> List[str]:
doc = self.nlp(text[:3000]) # truncate long docs for speed
tokens = [(t.text, t.lemma_.lower(), t.pos_) for t in doc if t.is_alpha]
bigrams = []
for i in range(len(tokens) - 1):
word1, lemma1, pos1 = tokens[i]
word2, lemma2, pos2 = tokens[i + 1]
if (pos1, pos2) in self.MEANINGFUL_BIGRAMS:
# Optionally lowercase both words to normalize
bigrams.append(f"{lemma1} {lemma2}")
return bigrams
topic_model.update_topics(docs=texts_clean,vectorizer_model=CountVectorizer(tokenizer=ImprovedTokenizer()))
topic_model.get_topic_info()

LLM
Finally, you can integrate LLMs to generate coherent titles or summaries for each topic. BERTopic supports OpenAI integration directly or through custom prompting. These LLM-based summaries drastically improve explainability.
import openai
from bertopic.representation import OpenAI
client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
topic_model.update_topics(texts_clean, representation_model=OpenAI(client, model="gpt-4o-mini", delay_in_seconds=5))
topic_model.get_topic_info()
The representations are now all meaningful sentences.

You can also write your own function for getting the LLM-generated title, and update it back to the topic model object by using update_topic_labels function. Please refer to the example code snippet below.
import openai
from typing import List
def generate_topic_titles_with_llm(
topic_model,
docs: List[str],
api_key: str,
model: str = "gpt-4o"
) -> Dict[int, Tuple[str, str]]:
client = openai.OpenAI(api_key=api_key)
topic_info = topic_model.get_topic_info()
topic_repr = {}
topics = topic_info[topic_info.Topic != -1].Topic.tolist()
for topic in tqdm(topics, desc="Generating titles"):
indices = [i for i, t in enumerate(topic_model.topics_) if t == topic]
if not indices:
continue
top_doc = docs[indices[0]]
prompt = f"""You are a helpful summarizer for topic clustering.
Given the following text that represents a topic, generate:
1. A short **title** for the topic (2–6 words)
2. A one or two sentence **summary** of the topic.
Text:
"""
{top_doc}
"""
"""
try:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant for summarizing topics."},
{"role": "user", "content": prompt}
],
temperature=0.5
)
output = response.choices[0].message.content.strip()
lines = output.split('n')
title = lines[0].replace("Title:", "").strip()
summary = lines[1].replace("Summary:", "").strip() if len(lines) > 1 else ""
topic_repr[topic] = (title, summary)
except Exception as e:
print(f"Error with topic {topic}: {e}")
topic_repr[topic] = ("[Error]", str(e))
return topic_repr
topic_repr = generate_topic_titles_with_llm( topic_model, texts_clean, os.environ["OPENAI_API_KEY"])
topic_repr_dict = {
topic: topic_repr.get(topic, "Topic")
for topic in topic.get_topic_info()["Topic"]
}
topic_model.set_topic_labels(topic_repr_dict)
Conclusion
This guide outlined actionable strategies to boost topic modeling results using BERTopic. By understanding the role of each module and tuning parameters for your specific domain, you can achieve more focused, stable, and interpretable topics.
Representation matters just as much as clustering. Whether it’s through n-grams, syntactic filtering, or LLMs, investing in better representations makes your topics easier to understand and more useful in practice.
BERTopic also offers advanced modeling techniques beyond the basics covered here. In a future post, we’ll explore those capabilities in depth. Stay tuned!