Home » Fine-Tune Your Topic Modeling Workflow with BERTopic

Fine-Tune Your Topic Modeling Workflow with BERTopic

Topic modeling remains a critical tool in the AI and NLP toolbox. While large language models (LLMs) handle text exceptionally well, extracting high-level topics from massive datasets still requires dedicated topic modeling techniques. A typical workflow includes four core steps: embedding, dimensionality reduction, clustering, and topic representation.

frameworks today is BERTopic, which simplifies each stage with modular components and an intuitive API. In this post, I’ll walk through practical adjustments you can make to improve clustering outcomes and boost interpretability based on hands-on experiments using the open-source 20 Newsgroups dataset, which is distributed under the Creative Commons Attribution 4.0 International license.

Project Overview

We’ll start with the default settings recommended in BERTopic’s documentation and progressively update specific configurations to highlight their effects. Along the way, I’ll explain the purpose of each module and how to make informed decisions when customizing them.

Dataset Preparation

We load a sample of 500 news documents.

import random
from datasets import load_dataset
dataset = load_dataset("SetFit/20_newsgroups")
random.seed(42)
text_label = list(zip(dataset["train"]["text"], dataset["train"]["label_text"]))
text_label_500 = random.sample(text_label, 500)

Since the data originates from casual Usenet discussions, we apply cleaning steps to strip headers, remove clutter, and preserve only informative sentences.

This preprocessing ensures higher-quality embeddings and a smoother downstream clustering process.

import re

def clean_for_embedding(text, max_sentences=5):
    lines = text.split("n")
    lines = [line for line in lines if not line.strip().startswith(">")]
    lines = [line for line in lines if not re.match
            (r"^s*(from|subject|organization|lines|writes|article)s*:", line, re.IGNORECASE)]
    text = " ".join(lines)
    text = re.sub(r"s+", " ", text).strip()
    text = re.sub(r"[!?]{3,}", "", text)
    sentence_split = re.split(r'(?<=[.!?]) +', text)
    sentence_split = [
        s for s in sentence_split
        if len(s.strip()) > 15 and not s.strip().isupper()
    ]
    return " ".join(sentence_split[:max_sentences])
texts_clean = [clean_for_embedding(text) for text,_ in text_label_500]
labels = [label for _, label in text_label_500]

Initial BERTopic Pipeline

Using BERTopic’s modular design, we configure each component: SentenceTransformer for embeddings, UMAP for dimensionality reduction, HDBSCAN for clustering, and CountVectorizer + KeyBERT for topic representation. This setup yields only a few broad topics with noisy representations, highlighting the need for fine-tuning to achieve more coherent results.

from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer

from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired

# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=10, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

# Step 6 - (Optional) Fine-tune topic representations with
# a `bertopic.representation` model
representation_model = KeyBERTInspired()

# All steps together
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model # Step 6 - (Optional) Fine-tune topic representations
)
topics, probs = topic_model.fit_transform(texts_clean)

This setup yields only a few broad topics with noisy representations. This result highlights the need for finetuning to achieve more coherent results.

Original discovered topics (Image generated by author)

Parameter Tuning for Granular Topics

n_neighbors from UMAP module

UMAP is the dimensionality reduction module to reduce origin embedding to a lower dimension dense vector. By adjusting UMAP’s n_neighbors, we control how locally or globally the data is interpreted during dimensionality reduction. Lowering this value uncovers finer-grained clusters and improves topic distinctiveness.

umap_model_new = UMAP(n_neighbors=5, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model.umap_model = umap_model_new
topics, probs = topic_model.fit_transform(texts_clean)
topic_model.get_topic_info()
Image generated by author
Topics discovered after setting the UMAP’s n_neighbors parameter (Image generated by author)

min_cluster_size and cluster_selection_method from HDBSCAN module

HDBSCAN is the clustering module set by default for BerTopic. By modifying HDBSCAN’s min_cluster_size and switching the cluster_selection_method from “eom” to “leaf” further sharpens topic resolution. These settings help uncover smaller, more focused themes and balance the distribution across clusters.

hdbscan_model_leaf = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='leaf', prediction_data=True)
topic_model.hdbscan_model = hdbscan_model_leaf
topics, _ = topic_model.fit_transform(texts_clean)
topic_model.get_topic_info()

The number of clusters increases to 30 by setting cluster_selection_method to leaf and min_cluster_size to 5.

Image generated by author
Topics discovered after setting HDBSCAN’s related parameters (Image generated by author)

Controlling Randomness for Reproducibility

UMAP is inherently non-deterministic, meaning it can produce different results on each run unless you explicitly set a fixed random_state. This detail is often omitted in example code, so be sure to include it to ensure reproducibility.

Similarly, if you’re using a third-party embedding API (like OpenAI), be cautious. Some APIs introduce slight variations on repeated calls. For reproducible outputs, cache embeddings and feed them directly into BERTopic.

from bertopic.backend import BaseEmbedder
import numpy as np
class CustomEmbedder(BaseEmbedder):
    """Light-weight wrapper to call NVIDIA's embedding endpoint via OpenAI SDK."""

    def __init__(self, embedding_model, client):
        super().__init__()
        self.embedding_model = embedding_model
        self.client = client

    def encode(self, documents):  # type: ignore[override]
        response = self.client.embeddings.create(
            input=documents,
            model=self.embedding_model,
            encoding_format="float",
            extra_body={"input_type": "passage", "truncate": "NONE"},
        )
        embeddings = np.array([embed.embedding for embed in response.data])
        return embeddings
topic_model.embedding_model = CustomEmbedder()
topics, probs = topic_model.fit_transform(texts_clean, embeddings=embeddings)

Every dataset domain may require different clustering settings for optimal results. To streamline experimentation, consider defining evaluation criteria and automating the tuning process. For this tutorial, we’ll use the cluster configuration that sets n_neighbors to 5, min_cluster_size to 5, and cluster_selection_method to “eom”. This is a combination that strikes a balance between granularity and coherence.

Improving Topic Representations

Representation plays a crucial role in making clusters interpretable. By default, BERTopic generates unigram-based representations, which often lack sufficient context. In the next section, we’ll explore several techniques to enrich these representations and improve topic interpretability.

Ngram 

n-gram range

In BERTopic, CountVectorizer is the default tool to convert text data into bag-of-words representations.  Instead of relying on generic unigrams, switch to bigrams or trigrams using ngram_range in CountVectorizer. This simple change adds much needed context.

Since we are only updating representation, BerTopic offers the update_topics function to avoid redoing the modeling all over again.

topic_model.update_topics(texts_clean, vectorizer_model=CountVectorizer(stop_words="english", ngram_range=(2,3)))
topic_model.get_topic_info()
Image generated by author
Topic representations using bigrams (Image generated by author)

Custom Tokenizer

Some bigrams are still hard to interpret e.g. 486dx 50, ac uk, dxf doc,… For greater control, implement a custom tokenizer that filters n-grams based on part-of-speech patterns. This removes meaningless combinations and elevates the quality of your topic keywords.

import spacy
from typing import List

class ImprovedTokenizer:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
        self.MEANINGFUL_BIGRAMS = {
            ("ADJ", "NOUN"),
            ("NOUN", "NOUN"),
            ("VERB", "NOUN"),
        }
    # Keep only the most meaningful syntactic bigram patterns
    def __call__(self, text: str, max_tokens=200) -> List[str]:
        doc = self.nlp(text[:3000])  # truncate long docs for speed
        tokens = [(t.text, t.lemma_.lower(), t.pos_) for t in doc if t.is_alpha]
       
        bigrams = []
        for i in range(len(tokens) - 1):
            word1, lemma1, pos1 = tokens[i]
            word2, lemma2, pos2 = tokens[i + 1]
            if (pos1, pos2) in self.MEANINGFUL_BIGRAMS:
                # Optionally lowercase both words to normalize
                bigrams.append(f"{lemma1} {lemma2}")
       
        return bigrams
topic_model.update_topics(docs=texts_clean,vectorizer_model=CountVectorizer(tokenizer=ImprovedTokenizer()))
topic_model.get_topic_info()
Image generated by author
Topic representations which removes messy bigrams (Image generated by author)

LLM

Finally, you can integrate LLMs to generate coherent titles or summaries for each topic. BERTopic supports OpenAI integration directly or through custom prompting. These LLM-based summaries drastically improve explainability.

import openai
from bertopic.representation import OpenAI

client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
topic_model.update_topics(texts_clean, representation_model=OpenAI(client, model="gpt-4o-mini", delay_in_seconds=5))
topic_model.get_topic_info()

The representations are now all meaningful sentences. 

Image generated by author
Topic representations which are LLM-generated sentences (Image generated by author)

You can also write your own function for getting the LLM-generated title, and update it back to the topic model object by using update_topic_labels function. Please refer to the example code snippet below.

import openai
from typing import List
def generate_topic_titles_with_llm(
    topic_model,
    docs: List[str],
    api_key: str,
    model: str = "gpt-4o"
) -> Dict[int, Tuple[str, str]]:
    client = openai.OpenAI(api_key=api_key)
    topic_info = topic_model.get_topic_info()
    topic_repr = {}
    topics = topic_info[topic_info.Topic != -1].Topic.tolist()

    for topic in tqdm(topics, desc="Generating titles"):
        indices = [i for i, t in enumerate(topic_model.topics_) if t == topic]
        if not indices:
            continue
        top_doc = docs[indices[0]]

        prompt = f"""You are a helpful summarizer for topic clustering.
        Given the following text that represents a topic, generate:
        1. A short **title** for the topic (2–6 words)
        2. A one or two sentence **summary** of the topic.
        Text:
        """
        {top_doc}
        """
        """

        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": "You are a helpful assistant for summarizing topics."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.5
            )
            output = response.choices[0].message.content.strip()
            lines = output.split('n')
            title = lines[0].replace("Title:", "").strip()
            summary = lines[1].replace("Summary:", "").strip() if len(lines) > 1 else ""
            topic_repr[topic] = (title, summary)
        except Exception as e:
            print(f"Error with topic {topic}: {e}")
            topic_repr[topic] = ("[Error]", str(e))

    return topic_repr

topic_repr = generate_topic_titles_with_llm( topic_model, texts_clean, os.environ["OPENAI_API_KEY"])
topic_repr_dict = {
    topic: topic_repr.get(topic, "Topic")
    for topic in topic.get_topic_info()["Topic"]
 }
topic_model.set_topic_labels(topic_repr_dict)

Conclusion

This guide outlined actionable strategies to boost topic modeling results using BERTopic. By understanding the role of each module and tuning parameters for your specific domain, you can achieve more focused, stable, and interpretable topics.

Representation matters just as much as clustering. Whether it’s through n-grams, syntactic filtering, or LLMs, investing in better representations makes your topics easier to understand and more useful in practice.

BERTopic also offers advanced modeling techniques beyond the basics covered here. In a future post, we’ll explore those capabilities in depth. Stay tuned!

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *