has been a long-standing challenge in the machine learning community.
Whenever a new paradigm comes along, whether it’s deep learning, reinforcement learning, self-supervised learning, or graph neural networks, you’ll almost always see practitioners eager to try it out on anomaly detection problems.
LLMs are, of course, no exception.
In this post, we’ll take a look at some emerging ways people are using LLMs in anomaly detection pipelines:
- Direct anomaly detection
- Data augmentation
- Anomaly explanation
- LLM-based representation learning
- Intelligent detection model selection
- Multi-agent system for autonomous anomaly detection
- (Bonus) Anomaly detection for LLM agentic systems
For each application pattern, we’ll check out concrete examples to see how it’s being applied in practice. Hopefully, this gives you a clearer sense of which pattern might be a good fit for your own challenges.
If you’re new to LLM & agents, I invite you to walk through a hands-on build in LangGraph 101: Let’s Build a Deep Research Agent.
1. Direct Anomaly Detection
1.1 Concept
The most common approach is to directly use an LLM to analyze the data and detect anomalies. Effectively, we are betting that the extensive, pre-trained knowledge (as well as knowledge supplied in the prompts) of LLMs is already good enough in distinguishing the abnormalities from the normal baseline.
1.2 Case Study
This way of using LLMs is the simplest when the underlying data is in text format. A case in point is the LogPrompt study [1], where the researchers looked at system log anomaly detection in the context of software operations.
The solution is straightforward: An LLM is first configured with a carefully drafted prompt. During inference, when given the new raw system logs, the LLM can output the anomaly prediction plus a human-readable explanation.
As you have probably guessed, the critical step in this workflow is the prompt engineering. In the work, the authors employed Chain-of-Thought prompting, few-shot in-context learning (with labeled examples), as well as domain-driven rule constraints. They reported that good performance is achieved with this hybrid prompting strategy.
For data modality beyond text, another interesting study worth mentioning is SIGLLM [2], a zero-shot anomaly detector for time series.
A key problem addressed in the work is the conversion of time-series data to text. To achieve that goal, the authors proposed a pipeline that consists of a scaling step, a quantization step, a rolling window creation step, and finally, a tokenization step. Once the LLM can properly understand time-series data, it can be used to perform anomaly detection either through direct prompting, or through forecasting, i.e., using discrepancies between predicted and actual values to flag anomalies.
1.3 Practical Considerations
This direct anomaly detection pattern stands out largely due to its simplicity, as LLMs are mainly treated as a standard, one-round input-output chatbot. Once you figure out how to convert your domain data into text and craft an effective prompt, you are good to go.
However, we should keep in mind that the implicit assumption made by this application pattern is that the LLM’s pre-trained knowledge (possibly augmented by a prompt) is sufficient for differentiating what is normal and what is abnormal. This might not hold for niche domains.
On top of that, the application pattern also faces challenges in defining the “normal” in the first place, information loss in data conversion, limited scalability, and potentially high cost, to name a few.
Overall, we can view it as a good entry point for using LLMs for anomaly detection, especially for text-based data, but keep in mind that it can only take you that far for many cases.
1.4 Resources
[1] Liu et al., Interpretable Online Log Analysis Using Large Language Models with Prompt Strategies, arXiv, 2023.
[2] Alnegheimish et al., Large language models can be zero-shot anomaly detectors for time series?, arXiv, 2024.
2. Data Augmentation
2.1 Concept
A common pain point of doing anomaly detection in practice is the lack of labeled abnormal samples. This cold, hard fact usually blocks practitioners from adopting the more effective supervised learning paradigm.
LLMs are generative models. Therefore, it’s only natural for practitioners to explore their ability to synthesize realistic anomalous samples. This way, we would obtain a more balanced dataset, making supervised anomaly detection a reality.
2.2 Case Study
An example we can learn from is NVIDIA’s Cyber Language Models for synthetic log generation [3].
In their work, the NVIDIA research team trained a GPT-2-sized foundation model specifically on the raw cybersecurity logs. Once the model is trained, it can be used to generate realistic synthetic logs for different purposes, such as user-specific log generation, scenario simulation, and suspicious event generation. Those synthetic data can be easily incorporated into the next training cycle of the digital fingerprinting pipeline of NVIDIA Morpheus to reduce the false positives.
2.3 Practical Considerations
Leveraging LLMs’ generative capability to overcome data scarcity is a cost-effective approach for improving the robustness and generalization of the downstream anomaly detection system. A big plus is that you can easily achieve controllable and targeted generation, i.e., prompting the LLMs to create data with particular characteristics, or target specific blind spots in your existing detection models.
However, the challenge also exists. For example, how to ensure the generated data is truly plausible, representative, and diverse? How to validate the quality of the synthetic data?
There are still many unknowns to be addressed. Nevertheless, if your problem suffers from a high false positive rate due to the lack of abnormal samples (or the diversity of normal samples), synthetic data generation via LLMs could still be worth a shot.
2.4 Resources
[3] Gorkem Batmaz, Building Cyber Language Models to Unlock New Cybersecurity Capabilities, NVIDIA Blog, 2024.
3. Anomaly Explanation
3.1 Concept
In practice, simply flagging anomalies is rarely enough. Practitioners often need to understand the “why” to determine the best next step. Traditional anomaly detection methods generally stop at producing a binary yes/no label. The gap between the “prediction” and the “action” can be potentially bridged by LLMs, thanks to their extensive, pre-trained knowledge and their language understanding & generating capabilities.
3.2 Case Study
An interesting example is given by the work [4], where the authors explored using LLMs (GPT-4 & LLaMA3) to provide explainable anomaly detection for time series data.
Compared to the work in SIGLLM we discussed earlier, this current work took one step further to not only identify anomalies but also generate natural language explanations for why specific points or patterns are considered abnormal. For example, when detecting a shape anomaly in a cyclical pattern, the system might explain: “There are anomalies in 2) indices 17, 18, and 19 – 3) Here, the values unexpectedly plateau at 4, which does not align with the previous cycles observed where after hitting the peak value, a decrease follows. This anomaly can be flagged as it interrupts the established cyclical pattern of peaks and Multi-modal Instruction troughs.”
However, the work also revealed that explanation quality varies significantly by anomaly type: Point anomalies generally lead to higher-quality explanations. In contrast, context-aware anomalies, such as shape anomalies or seasonal/trend anomalies, seem to be more challenging to obtain accurate explanations.
3.3 Practical Considerations
This “anomaly explanation” pattern works best when you need to understand the reasoning for guiding the subsequent action. It could also come in handy when you are not satisfied with simple statistical explanations that might fail to capture complex data patterns.
However, guard against hallucination. At the current stage, we still see LLMs generate plausible-sounding but actually incorrect statements. This could also apply to anomaly explanation.
3.4 Resources
[4] Done et al., Can LLMs Serve As Time Series Anomaly Detectors?, arXiv, 2024.
If you are also interested in analytical explainable AI techniques, please feel free to check out my blog: Explainable Anomaly Detection with RuleFit: An Intuitive Guide.
4. LLM-based Representation Learning
4.1 Concept
Generally, we can think of an ML-based anomaly detection task consists of the following 3 steps:
Feature engineering –> Anomaly detection –> Anomaly explanation
If LLMs can be applied in anomaly detection step (pattern #1) and anomaly explanation step (pattern #3), we really don’t see why it cannot be applied to the first step, i.e., feature engineering.
Specifically, this application pattern treats LLMs as feature transformers that convert raw data into a new semantic latent space, which better describes complex patterns and relationships in data. Then, traditional anomaly detection algorithms can take those transformed features as inputs and hopefully, produce superior detection performance.
4.2 Case Study
A representative case study is given in one of Databricks’ technical blogs [5], which is about detecting fraudulent purchases.
In the work, LLMs are first used to compute the embeddings of the purchase data. Then, a traditional anomaly detection algorithm (e.g., PCA, or clustering-based approaches) is used to score the abnormality of the embedding vectors. Anomaly flags are raised for items whose anomaly score is higher than a pre-defined threshold.
What’s also interesting about this work is that a hybrid approach is proposed: the identified anomalies via embeddings + PCA are further analyzed by an LLM to obtain deeper contextual understanding and explanations, i.e., clarify why a particular product is flagged anomalous. Effectively, it combines both pattern #3 and the current pattern to deliver a comprehensive anomaly detection solution. As the authors pointed out in the blog, this hybrid approach maintains accuracy and interpretability while keeping costs lower and making the solution more scalable.
4.3 Practical Considerations
Using LLMs to transform raw data is a powerful approach that can effectively capture deep semantic meaning and context. This paves the way for employing classic anomaly detection algorithms, while still being able to reach high performance.
Nevertheless, we should also keep in mind that the embedding produced by LLMs is a high-dimensional, opaque vector, which could make it hard to explain the root cause of a detected anomaly.
Also, the quality of the representation is entirely dependent on the knowledge baked into the pre-trained LLM. If your data is highly domain-specific, the resulting embeddings may not be meaningful. As a consequence, the anomaly detection performance might be poor.
Finally, generating embeddings is not free. In fact, you are running a forward pass through a very large neural network, which is significantly more computationally expensive and introduces more latency than traditional feature engineering methods. This can be a major issue for real-time detection systems.
4.4 Resources
[5] Kyra Wulffert, Anomaly detection using embeddings and GenAI, Databricks Technical Blog, 2024.
5. Intelligent Detection Model Selection
5.1 Concept
When building an anomaly detection solution in practice, one big headache—for both beginners and experienced practitioners—is picking the right model. With so many algorithms out there, it’s not always clear which one will work best for your dataset. Traditionally, this is pretty much an expert-knowledge-driven, trial-and-error process.
LLMs, thanks to their extensive pre-training, have likely already accumulated quite some knowledge about the theories of various anomaly detection algorithms, and which algorithms are best suited for which kind of problem/data characteristics.
Therefore, it is only natural to capitalize on this pre-trained knowledge, as well as the reasoning capabilities, of the LLMs to automate the model recommendation process.
5.2 Case Study
In the new release of the pyOD 2 library [6] (which is the go-to library for detecting anomalies/outliers in multivariate data), the developers introduced the new functionality of LLM-driven model selection for anomaly/outlier detection.
This recommendation system operates through a three-step process:
- Model Profiling – analyzing each algorithm’s research papers and source code to extract symbolic metadata describing strengths (e.g., “effective in high-dimensional data”) and weaknesses (e.g., “computationally heavy”).
- Dataset Profiling – computing statistical characteristics like dimensionality, skewness, and noise levels, then using LLMs to convert these metrics into standardized symbolic tags.
- Intelligent Selection – applying symbolic matching followed by LLM-based reasoning to evaluate trade-offs among candidate models and select the most suitable option.
This way, the model recommendation system is able to make its choices transparent and easy to understand. Also, it is flexible enough to easily adapt when new models are introduced.
5.3 Practical Considerations
Treating LLMs as “AI judges” is already a trendy topic in the broader AutoML field, as it holds quite some promise in addressing the scalability of expert knowledge. This could be especially helpful for junior practitioners who may lack deep expertise in statistics, machine learning, or the specific data domain.
Another advantage of this application pattern is that it helps codify and standardize best practices. We can easily integrate a team/organization’s internal best practices into the LLMs’ prompt. This way, we can ensure that the solutions being developed are not just effective but also consistent, maintainable, and compliant.
However, we should always stay sharp about the hallucination of recommendations/justifications that LLMs might produce. Never blindly trust the results; always verify the LLMs’ reasoning traces.
Also, the field of anomaly detection is constantly evolving, with new algorithms and techniques popping up regularly. This means LLMs might operate on an outdated knowledge base, suggesting older, less-effective methods instead of the newer model that is perfectly suited for the problem. RAG is critical here to keep LLMs’ knowledge current and ensure the effectiveness & relevance of the proposed suggestions.
5.4 Resources
[6] Chen et al., PyOD 2: A Python Library for Outlier Detection with
LLM-powered Model Selection, arXiv, 2024.
6. Multi-Agent System for Autonomous Anomaly Detection
6.1 Concept
A multi-agent system (MAS) refers to a system where multiple specialized agents (powered by LLMs) collaborate to achieve a pre-defined goal. The agents are usually specialized in tasks or in skills (with certain document access/retrieval capability or tools to call). This is one of the fastest-growing fields in LLM applications, and practitioners are also looking into how this new toolkit can be used to drive end-to-end autonomous anomaly detection.
For a hands-on agent graph you can adapt for anomaly triage and rule synthesis, see LangGraph 101.
6.2 Case Study
For this application pattern, let’s take a look at the Argos system [7]: An agentic system for time-series anomaly detection in the cloud infrastructure powered by LLMs.
The developed system relies on reproducible and explainable detection rules to flag anomalies in time-series data. As a result, the core of the system is to ensure the robust generation of those detection rules.
To achieve that goal, the developers composed a three-agent collaborative pipeline:
- Detection Agent, which generates Python-based anomaly detection rules by analyzing time-series data patterns and implementing them as executable code.
- Repair Agent, which checks proposed rules for syntax errors by executing them on dummy data, and provides error messages and corrections until all syntax issues are resolved.
- Review Agent, which evaluates rule accuracy on validation data, compares performance with previous iterations, and provides feedback for improvement.
Note that those agents are not working in a simple linear fashion, but rather forming an iterative loop that continues to improve the rule accuracy. For example, if any issues are detected by the Review Agent, the rules will be sent back to the Repair Agent to repair; otherwise, they will be fed back to the Detection Agent to incorporate new rules.
Another interesting design pattern presented in this work is the fusion of LLM-generated rules with existing anomaly detectors that have been well-tuned over time in production. This pattern enjoys the benefits of both worlds: analytical AI and Generative AI.
6.3 Practical Considerations
The Multi-agent system is an advanced application pattern for integrating LLMs into the anomaly detection pipeline. The core benefits include the specialization and division of labor, where each agent can be equipped with highly specialized instructions, tools, and context, as well as the possibility of achieving truly autonomous end-to-end problem-solving.
On the other hand, however, this application pattern inherits all the pain points of the Multi-agent system. To name a few, significantly increased complexity in design, implementation, and maintenance; Cascading errors and miscommunication; And high cost and latency, making large-scale or real-time applications infeasible.
6.4 Resources
[7] Gu et al., Argos: Agentic Time-Series Anomaly Detection with Autonomous Rule Generation via Large Language Models, arXiv, 2025.
7. Anomaly Detection for LLM Agentic Systems
7.1 Concept
As a bonus section, let’s discuss another emerging pattern that combines LLMs with anomaly detection. This time, we turn the tables around: instead of applying LLMs to assist anomaly detection, let’s explore how anomaly detection strategies can be used to monitor the behavior of the LLM systems.
As we briefly mentioned in the previous section, the adoption of multi-agent systems (MAS) is becoming mainstream. What comes with it are the new security and reliability challenges.
Now, if we see MAS from a high level, we can simply treat it as just another complex industrial system that takes some inputs, generates some outputs, and emits telemetry data along the way. In that case, why not employ anomaly detection approaches to detect abnormal behaviors of MAS?
7.2 Case Study
For this application pattern, let’s take a look at a recent work called SentinelAgent [8], a graph-based anomaly detection system designed to monitor LLM-based MASs.
For any system monitoring solution, it should address two key questions:
- How to extract meaningful, analyzable features from the system?
- How to act on this feature data for anomaly detection?
For the first question, SentinelAgent addresses it by modeling the agent interactions as dynamic execution graphs, where nodes are agents or tools, while edges represent interactions (messages and invocations). This way, the heterogeneous, unstructured outputs of multi-agent systems are transformed into a clean, analyzable graph representation.
For data collection, SentinelAgent uses OpenTelemetry [9] (standard observability frameworks) to intercept runtime events with minimal overhead. In addition, the Phoenix platform [10] is used for event monitoring, which can collect execution traces of agent systems in near real-time.
For the second question, SentinelAgent combines rule-based classification with LLM-based semantic reasoning (pattern #1) for behavior analysis on the collected telemetry data. This enables detection across multiple granularities from individual agent misbehavior to complex multi-agent attack patterns.
The solution was validated on two case studies, i.e., an email assistant system and Microsoft’s Magentic-One generalist system. The authors showed that the SentinelAgent successfully detected sophisticated attacks, including prompt injection propagation, unauthorized tool usage, and multi-agent collusion scenarios.
7.3 Practical Considerations
As LLM-based MASs become increasingly deployed in production environments, this application pattern of applying anomaly detection to MAS will only become more important.
However, the current approach of using LLMs as behavioral judges introduces a significant scalability challenge. We are essentially using another LLM-based system to monitor the target MAS. The cost and latency can be serious concerns, especially when monitoring systems with high message throughput or complex execution patterns.
Ironically, the monitoring system itself (SentinelAgent) can be a potential attack target. Since it relies on LLM-based reasoning for semantic analysis, it inherits the same vulnerabilities it aims to detect (think of prompt injection, hallucination, or adversarial manipulation). An attacker who compromises the monitoring system could potentially blind the organization to ongoing attacks or create false alerts that mask real threats.
One way out could be developing standardized telemetry formats and methods to engineer numerical features from multi-agent system interactions. This way, we would be able to leverage conventional, well-established anomaly detection algorithms, which provide more scalable and cost-effective monitoring solutions, while also reducing the attack surface of the monitoring system itself.
7.4 Resources
[8] He et al., SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems, arXiv, 2025.
[9] OpenTelemetry Documentation.
[10] Arize AI, Phoenix Documentation.
8. Conclusion
Now we have covered the most prominent, emerging patterns of applying LLMs to anomaly detection. If we look back, it is not hard to realize that LLMs can actually be applied to all steps of a typical anomaly detection workflow:
On top of that, we also see that the reverse application, i.e., using anomaly detection methods to monitor LLM-based systems themselves, is gaining some serious traction, creating a bidirectional relationship between these two domains.
By now, you’ve seen how the versatility of LLMs opens up a whole new toolbox for tackling anomaly detection. Hopefully, this post gives you some inspiration to experiment, adapt, and push the boundaries in your own anomaly detection workflows.