Home » Topic Model Labelling with LLMs | Towards Data Science

Topic Model Labelling with LLMs | Towards Data Science

By: Petr Koráb*, Martin Feldkircher**, *** Viktoriya Teliha** (*Text Mining Stories, Prague, **Vienna School of International Studies, ***Centre for Applied Macroeconomic Analysis, Australia).

of terms produced by topic models requires domain experience and may be subjective to the labeler. Especially when the number of topics grows large, it might be convenient to assign human-readable names to topics automatically with an LLM. Simply copying and pasting the results into UIs, such as chatgpt.com, is quite a “black-box” and unsystematic. A better choice would be to add topic labeling to the code with a documented labeler, which gives the engineer more control over the results and ensures reproducibility. This tutorial will explore in detail:

  • How to train a topic model with a fresh new Turftopic Python package
  • How to label topic model results with GPT-4.0 mini.

We will train a cutting-edge FASTopic model by Xiaobao Wu et al. [3] presented at last year’s NeurIPS. This model outperforms other competing models, such as BERTopic, in several key metrics (e.g., topic diversity) and has broad applications in business intelligence.

1. Components of the Topic Modelling Pipeline

Labelling is the essential part of the topic modelling pipeline because it bridges the model outputs with real-world decisions. The model assigns a number to each topic, but a business decision relies on the human-readable text label summarizing the typical terms in each topic. The models are typically labelled by (1) labellers with the domain experience, often using a well-defined labelling strategy, (2) LLMs, and (3) commercial tools. The path from raw data to decision-making through a topic model is nicely explained in Image 1.

Image 1. Components of the topic modeling pipeline.
Source: adapted and extended from Kardos et al [2].

The pipeline starts with raw data, which is preprocessed and vectorized for the topic model. The model returns topics named with integers, including typical terms (words or bigrams). The labeling layer replaces the integer in the topic name with the text label. The model user (product manager, customer care dept., etc.) then works with labelled terms to make data-informed decisions. In the following modeling example, we will follow it step by step.

2. Data

We will use FASTopic to classify customer complaints data into 10 topics. The example use case uses a synthetically generated Customer Care Email dataset available on Kaggle, licensed under the GPL-3 license. The prefiltered data covers 692 incoming emails to the customer care department and looks like this:

Image 2. Customer Care Email dataset. Image by authors.

2.1. Data preprocessing

Text data is sequentially preprocessed in six steps. Numbers are removed first, followed by emojis. English stopwords are removed afterward, followed by punctuation. Additional tokens (such as company and person names) are removed in the next step before lemmatization. Read more on text preprocessing for topic models in our previous tutorial.

First, we read the clean data and tokenize the dataset:

import pandas as pd

# Read data
data = pd.read_csv("data.csv", usecols=['message_clean'])

# Create corpus list
docs = data["message_clean"].tolist()
Image 3. Recommended cleaning pipeline for topic models. Image by authors.

2.2. Bigram vectorization

Next, we create a bigram tokenizer to process tokens as bigrams during the model training. Bigram models provide more relevant information and identify better key qualities and problems for business decisions than single-word models (“delivery” vs. “poor delivery”, “stomach” vs. “sensitive stomach”, etc.).

from sklearn.feature_extraction.text import CountVectorizer

bigram_vectorizer = CountVectorizer(
    ngram_range=(2, 2),               # only bigrams
    max_features=1000                 # top 1000 bigrams by frequency
)

3. Model training

The FASTopic model is currently implemented in two Python packages:

  • Fastopic: official package by X. Wu
  • Turftopic: new Python package that brings many helpful topic modeling features, including labeling with LLMs [2]

We will use the Turftopic implementation because of the direct link between the model and the Namer that offers LLM labelling.

Let’s set up the model and fit it to the data. It is essential to set a random state to secure training reproducibility.

from turftopic import FASTopic

# Model specification
topic_size  = 10
model = FASTopic(n_components = topic_size,       # train for 10 topics
                 vectorizer = bigram_vectorizer,  # generate bigrams in topics
                 random_state = 32).fit(docs)     # set random state 

# Fit model to corpus
topic_data = model.prepare_topic_data(docs)

Now, let’s prepare a dataframe with topic IDs and the top 10 bigrams with the highest probability received from the model (code is here).

Image 4. Unlabeled topics in FASTopic. Image by authors.

4. Topic labeling

In the next step, we add text labels to the topic IDs with GPT4-o-mini. Let’s follow these steps:

With this code, we label the topics and add a new row topic_name to the dataframe.

from turftopic.namers import OpenAITopicNamer
import os

# OpenAI API key key to access GPT-4
os.environ["OPENAI_API_KEY"] = ""   

# use Namer to label topic model with LLM
namer = OpenAITopicNamer("gpt-4o-mini")
model.rename_topics(namer)

# create a dataframe with labelled topics
topics_df = model.topics_df()
topics_df.columns = ['topic_id', 'topic_name', 'topic_words']

# split and explode
topics_df['topic_word'] = topics_df['topic_words'].str.split(',')
topics_df = topics_df.explode('topic_word')
topics_df['topic_word'] = topics_df['topic_word'].str.strip()

# add a rank for each word within a topic
topics_df['word_rank'] = topics_df.groupby('topic_id').cumcount() + 1

# pivot to wide format
wide = topics_df.pivot(index='word_rank', 
                       columns=['topic_id', 'topic_name'], values='topic_word')

Here is the table with labeled topics after additional transformations. It would be interesting to compare the LLM results with those of a company insider who is familiar with the company’s processes and customer base. The dataset is synthetic, so let’s rely on the GPT-4 labeling.

Image 5. Labeled topics in FASTopic by GPT4–o-mini. Image by authors.

We can also visualize the labeled topics for a better presentation. The code for the bigram word cloud visualization, generated from the topics produced by the model, is here.

Image 6. Word cloud visualization of labeled topics by GPT4–o-mini. Image by authors.

Summary

  • The new Turftopic Python package links recent topic models with the LLM-based labeler for generating human-readable topic names.
  • The main benefits are: 1) independence from the labeler’s subjective experience, 2) capacity to label models with a large number of topics that a human labeler might have difficulty labeling independently, and 3) more control of the code and reproducibility.
  • Topic labeling with LLMs has a wide range of applications in diverse areas. Read our latest paper on the topic modeling of central bank communication, where GPT-4 labeled the FASTopic model.
  • The labels are slightly different for each training, even with the random state. It is not caused by the Namer, but by the random processes in model training that output bigrams with probabilities in descending order. The differences in probabilities are in tiny decimals, so each training generates a few new terms in the top 10, which then impacts the LLM labeler.

The data and complete code for this tutorial are here.

Petr Korab is a Senior Data Analyst and Founder of Text Mining Stories with over eight years of experience in Business Intelligence and NLP.

Sign up for our blog to get the latest news from the NLP industry!

References

[1] Feldkircher, M., Korab, P., Teliha, V., (2025). What Do Central Bankers Talk About? Evidence From the BIS Archive,” CAMA Working Papers 2025–35, Centre for Applied Macroeconomic Analysis, Crawford School of Public Policy, The Australian National University.

[2] Kardos, M., Enevoldsen, K. C., Kostkan, J., Kristensen-McLachlan, R. D., Rocca, R. (2025). Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers. Journal of Open Source Software, 10(111), 8183, https://doi.org/10.21105/joss.08183.

[3] Wu, X, Nguyen, T., Ce Zhang, D., Yang Wang, W., Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. arXiv preprint: 2405.17978.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *