Models alone aren’t enough; having a full system stack and great, successful products is the key. Satya Nadella – [1].
of specialized models, vision, language, segmentation, diffusion, and Mixture-of-Experts, often orchestrated together. Today the stack spans several types: LLM, LCM, LAM, MoE, VLM, SLM, MLM, and SAM [2], alongside the rise of agents. Precisely because the stack is this heterogeneous and fast-moving, teams need a practical framework that guarantees: rigorous evaluation (factuality, relevance, drift), built-in safety and compliance (PII, policy, red-teaming), and dependable operations (CI/CD, observability, rollback, cost controls). Without a framework, most models simply mean more risk and less reliability.
In this article, we’ll summarize the main issues and illustrate them with a real application example. This is not meant to be exhaustive but aims to illustrate the key difficulties.
It is well known that a picture is worth a thousand words (and, behind the shine of AI, perhaps even more). To illustrate and emphasize the core challenges in developing systems based on Large Language Models (LLMs), I created the diagram in Figure 1, which outlines a potential approach for managing information buried inside long, tedious contracts. While many claim that AI will revolutionize every aspect of technology (and leave data scientists and engineers without jobs), the reality is that building robust, reproducible, and reliable applications requires a framework of continuous improvement, rigorous evaluation, and systematic validation. Mastering this rapidly evolving landscape is anything but trivial, and it would certainly take more than a single article to highlight all the details.
This diagram (Figure 1) could be a perfect application for a contact center in the insurance sector. If you’ve ever tried to read your own insurance contract, you’ll know it’s often dozens of pages filled with dense legal language (the kind most of us tend to skip). The truth is, even many insurance employees don’t always know the fine details… but let’s keep that between us! 😉 After all, who could memorize the exact coverages, exclusions, and limits across hundreds of products and insurance types? That’s precisely the kind of complexity we aimed to address with this system.
The ultimate goal is to create a tool for the contact center staff, adjusters and fraud investigators, that can instantly answer complex questions such as policy limits or specific coverage conditions in specific situations. But while the application may seem simple on the surface, what I want to highlight here are the deeper challenges any developer faces when maintaining and improving these kinds of systems.
This goes far beyond just building a cool demo. It requires continuous monitoring, validation, bias mitigation, and user-driven refinement. These tasks are often labeled ‘Business-As-Usual’ (BAU), but in practice they demand significant time and effort. This technical debt for this particular example can fall under a category called LLMOps (or more broadly GenAIOps or AIOps), which, although based on similar principles to the old friend MLOps, includes the unique challenges of Large Language Models.
It’s a field that blends DevOps with governance, safety and responsibility, and yet… unless you’re seen as an ‘innovation’, no one pays much attention. Until it breaks. Then suddenly it becomes important (especially when regulators come knocking with RAI fines).
As promised, after my long complaint 😓, let me walk you through the actual steps behind the diagram.
It all starts, of course, with data (yeah… you still need data scientists or data engineers). Not clean, beautiful, labeled data… no. We’re talking raw, unstructured and unlabeled data, “things” like insurance dictionaries, multi-page policy contracts, or even transcripts from contact center conversations. And because we want to build something useful, we also need a truth or gold standard (benchmark you trust for evaluation), or at least something like that… If you’re an ethical professional who wants to build real value, you’ll need to find the truth in the simplest way, but always the right way.
Once you’ve got the raw data, the next step is to process it into something the LLM can actually digest. That means cleaning, chunking, standardizing, and if you’re working with transcripts, removing all the filler and noise (step 4 in Figure 1).
Now, we use a prompt and a base LLM (in this case LLaMA) to automatically generate question-answer pairs from the processed documents and transcriptions (step 5, that uses steps 1-2). That forms the supervised dataset for fine-tuning (step 6). These data should contain the pair question-answer, the categorization of the question and the source (name and page), this last for validation. The prompt instructs the model to explicitly state when sources are contradictory or when the required information is missing. For the categorization, we assign each question a category using zero-shot classification over a fixed taxonomy; when higher accuracy is needed, we switch to few-shot classification by adding a few labeled examples to the prompt.
LLM-assisted labelling accelerates setup but has drawbacks (hallucinations, shallow coverage, style drift), so it is important to pair it with automatic checks and targeted human review before training.
Additionally, we create a ground-truth (step 3) set: domain-expert–authored question–answer pairs with sources, used as a benchmark to evaluate the solution. This sample has fewer rows than the fine-tuning dataset but gives us a clear idea of what to expect. We can also expand it during pilot trials with a small group of users before production.
To personalize the user’s response (LLMs lacks of specialized domain knowledge) we decided to fine-tune a open-source model called Mixtral using LoRA (step 6). The idea was to make it more “insurance-friendly” and able to respond in a tone and language closer to how real insurance people communicate, we evaluate the results with steps 3 and 7. Of course, we also wanted to complement that with long-term memory, which is where AWS Titan embeddings and vector search come into play (step 8). This is the RAG architecture, combining semantic retrieval with context-aware generation.
From there, the flow is simple:
The user asks a question (step 13), the system retrieves top relevant chunks (step 9 & 10) from the knowledge base using vector search + metadata filters (to make it more scalable to different insurance branches and types of clients), and the LLM (fine-tuned multilingual Mixtral) generates a well-grounded response using a carefully engineered prompt (step 11).
These elements summarise the diagram, but behind this there are challenges and details that, if not taken care of, can lead to reproducing unwanted behaviour; for this reason, there are elements that are necessary to incorporate in order not to lose control of the application.
Well … Let’s begin with the article 😄…
In production, things change:
- Users ask unexpected questions.
- Context retrieval fails silently and the model answers with false confidence.
- Prompts degrade in quality over time, that’s called prompt drift.
- Business logic shifts (over time, policies evolve: new exceptions, amended terms, new clauses, riders/endorsements, and entirely new contract versions driven by regulation, market shifts, and risk changes.)
- Fine-tuned models behave inconsistently.
This is the part most people forget: the lifecycle doesn’t end at deployment, it starts there.
What does “Ops” cover?
I created this diagram (Figure 2) to visualize how all the pieces fit together. The steps, the logic, the feedback loops, at least how I lived them in my experience. There are certainly other ways to represent this, but this is the one I find most complete.

We assume this diagram runs on a secure stack with controls that protect data and prevent unauthorized access. This doesn’t remove our responsibility to verify and validate security throughout development; for that reason, I include a developer-level safeguard box, which I’ll explain in more detail later.
We intentionally follow a linear gate: Data Management → Model Development → Evaluation & Monitoring → Deploy (CI/CD). Only models that pass offline checks are deployed; once in production online monitoring then feeds back into data and model refinement (loop). After the first deployment, we use online monitoring to continuously refine and improve the solution.
Just in case, we briefly describe each step:
- Model Development: here you define the “ideal” or “less wrong” model architecture aligned with business necessities. Gather initial datasets, maybe fine-tune a model (or just prompt-engineer or RAG or altogether). The goal? Get something working — a prototype/MVP that proves feasibility. After the first production release, keep refining via retraining and, when appropriate, incorporate advanced techniques (e.g., RL/RLHF) to improve performance.
- Data Management: Handling versions for data and prompts; keep metadata related with versioning, schemas, sources, operational signals (as token usage, latency, logs), etc. Manage and govern raw and processed data in all their forms: printed or handwritten, structured and unstructured, including texts, audio, videos, images, relational, vectorial and graphs databases or any other type that can be used by the system. Even extract information from unstructured formats and metadata; maintain a graph store that RAG can query to power analytical use cases. And please don’t make me talk about “quality,” which is often poorly handled, introduces noise into the models, and ultimately makes the work harder.
- Model Deployment (CI/CD): package the model and its dependencies into a reproducible artifact for promotion across environments; expose the artifact for inference (REST/gRPC or batch); and run testing pipelines that automatically check every change and block deployment if thresholds fail (unit tests, data/schema checks, linters, offline evals on golden sets, performance/security scans, canary/blue-green with rollback).
- Monitoring & Observability: Tracking model performance, drift, usage, errors in production.
- Safeguards: Defend against prompt injection, enforce access controls, protect data privacy, and evaluate for bias and toxicity.
- Cost Management: Monitoring and controlling usage and costs; budgets, per-team quotas, tokens, etc.
- Business value: Develop a business case to analyse whether projected outcomes actually makes sense compared to what is actually delivery. The value of this kind of solution is not seen immediately, but rather over time. There are a series of business considerations that generate costs and should help determine whether the application still makes sense or not. This step is not an easy one (especially for embedded applications), but at the very least, it requires discussion and debate. It is an exercise that is required to be done.
So, to transform our prototype into a production-grade, maintainable application, several critical layers must be addressed. These aren’t extras; they’re the essential steps to ensure every detail is properly managed. In what follows, I’ll focus on observability (evaluation and monitoring) and safeguards, since the broader topic could fill a book.
Evaluation & Monitoring
Observability is about continuously monitoring the system over time to ensure it keeps performing as expected. It involves tracking the key metrics to detect gradual degradation, drift, or other deviations across inputs, outputs, and intermediate steps (retrieval results, prompt, API calls, among others), and capturing them in a form that supports analysis and subsequent refinement.

With this in place, you can automate alerts that trigger when defined thresholds are crossed e.g., a sudden drop in answer relevance, a rise in retrieval latency, or unexpected spikes in token usage.
To ensure that the application behaves correctly at different stages, it is highly useful to create a truth or golden dataset curated by domain experts. This dataset serves as a benchmark for validating responses during training, fine-tuning, and evaluation (step 3, figure 1).
Evaluate fine-tuning:
We begin by measuring hallucination and answer relevance. We then compare these metrics between a baseline Mixtral model (without fine-tuning) and our Mixtral model fine-tuned for insurance-specific language (step 6, Figure 1).
The comparison between the baseline and the fine-tuned model serves two purposes: (1) it shows whether the fine-tuned model is better adapted to the Q&A dataset than the untuned baseline, and (2) it allows us to set a threshold to detect performance degradation over time, both relative to prior versions and to the baseline.
With this in mind we tried Claude 3 (via AWS Bedrock) to score each model response against a domain-expert gold answer. The highest score means “equivalent to or very close to the gold truth,” and the lowest means “irrelevant or contradictory.”
Claude claim-level evaluator. We decompose each model answer into atomic claims. Given the gold evidence, Claude labels each claim as entailed / contradicted / not_in_source and returns JSON. If the context lacks the information to answer, a correct response like entailed (we prefer no answer than wrong answer). For each answer we compute Claim Support (CS) = #entailed / total_claims and Hallucination rate = 1 − CS, then report dataset scores by averaging CS (and HR) across all answers. This directly measures how much of the answer is confirmed by the domain expert answer and aligns with claim-level metrics found in the literature [3].
This claim-level evaluator offers greater granularity and effectiveness, especially when an answer contains a mix of correct and incorrect statements. Our previous scoring method assigned a single grade to overall performance, which obscured specific errors that needed to be addressed.
The idea is to extend this metric to verify answers against the documentary sources and, additionally, maintain a second benchmark that is easier to build and update than a domain-expert set (and less prone to error). Achieving this requires further refinement.
Additionally, to assess answer relevance, we compute cosine similarity between embeddings of the model’s answer and the gold answer. The drawback is that embeddings can look “similar” even when the facts are wrong. As an alternative, we use an LLM-as-judge (Claude) to label relevance as direct, partial, or irrelevant, (taking in account the question) similar to the approach above.
These evaluations and ongoing monitoring can detect issues such as a question–answer dataset lacking context, sources, sufficient examples, or proper question categorization. If the fine-tuning prompt differs from the inference prompt, the model may tend to ignore sources and hallucinate in production because it never learned to ground its outputs in the provided context. Whenever any of these variables change, the monitoring system should trigger an alert and provide diagnostics to facilitate investigation and remediation.
Moderation:
To measure moderation or toxicity, we used the DangerousQA benchmark (200 adversarial questions) [4] and had Claude 3 evaluate each response with an adapted prompt, scoring 1 (highly negative) to 5 (neutral) across Toxicity, Racism, Sexism, Illegality, and Harmful Content. Both the base and fine-tuned Mixtral models consistently scored 4–5 in all categories, indicating no toxic, illegal, or disrespectful content.
Public benchmarks, such as DangerousQA often leak into LLM training data, which means that new models memorize this test items. This train-test data overlap leads to inflated scores and can obscure real risks. To mitigate it, alternatives like develop private benchmarks, rotate evaluation sets, or generate fresh benchmark variants are necessary to ensuring that test contamination does not artificially inflate model performance.
Evaluate RAG:
Here, we focus exclusively on the quality of the retrieved context. During preprocessing (step 4, Figure 1), we divide the documents into chunks, aiming to encapsulate coherent fragments of information. The objective is to ensure that the retrieval layer ranks the most useful information at the top before it reaches the generation model.
We compared two retrieval setups: (A) without reranking : return the top-k passages using keyword or dense embeddings only; and (B) with reranking: retrieve candidates via embeddings, then reorder the top-k with a reranker (pretrained ms-marco-mini-L-12-v2 model in LangChain). For each question in a curated set with expert gold truth, we labeled the retrieved context as Complete, Partial, or Irrelevant, then summarized coverage (percent Complete/Partial/Irrelevant) and win rates between setups.
Re-ranking consistently improved the context quality of results, but the gains were highly sensitive to chunking/segmentation: fragmented or incoherent chunks (e.g., clipped tables, duplicates) degraded final answers even when the relevant pieces were technically retrieved.
Finally, during production, user feedback and answer ratings from users is collected to enrich this ground truth over time. Frequently asked questions (FAQs) and their verified responses are also cached to reduce inference costs and provide fast, reliable answers with high confidence.
Rubrics as an alternative:
The quick evaluation approach used to assess the RAG and fine-tuned model provides an initial general overview of model responses. However, an alternative under consideration is a multi-step evaluation using domain-specific grading rubrics. Instead of assigning a single overall grade, rubrics break down the ideal answer into a binary checklist of clear, verifiable criteria. Each criterion is marked as yes/no or true/false and supported by evidence or sources, enabling a precise diagnosis of where the model excels or falls short [15]. This systematic rubric approach offers a more detailed and actionable assessment of model performance but requires time for development, so it remains part of our technical debt roadmap.
Safeguards
There is often pressure to deliver a minimum viable product as quickly as possible, which means that checking for potential vulnerabilities in datasets, prompts, and other development components is not always a top priority. However, extensive literature highlights the importance of simulating and evaluating vulnerabilities, such as testing adversarial attacks by introducing inputs that the application or system did not encounter during training/development. To effectively implement these security assessments, it is crucial to foster awareness that vulnerability testing is an essential part of both the development process and the overall security of the application.
In Table 2, we outline several attack types with example impacts. For instance, GitLab recently faced a remote prompt injection that affected the AI Duo code assistant, resulting in source code theft. In this incident, attackers embedded hidden prompts in public repositories, causing the assistant to leak sensitive information from private repositories to external servers. This real-world case highlights how such vulnerabilities can lead to breaches, underscoring the importance of anticipating and mitigating prompt injection and other emerging AI-driven threats in application security.

Additionally, we must be aware of biased outputs in AI results. A 2023 Washington Post article titled “This is how AI image generators see the world” demonstrates, through images, how AI models reproduce and even amplify the biases present in their training data. Ensuring fairness and mitigating bias is an important task that often gets overlooked due to time constraints, yet it remains crucial for building trustworthy and equitable AI systems.
Conclusion
Although the main idea of the article was to illustrate the complexities of LLM-based applications through the example of a typical (but synthetic) use case, the reason for emphasizing the need for a robust and scalable system is clear: building such applications is far from simple. It is essential to remain vigilant about potential issues that may arise if we fail to continuously monitor the system, ensure fairness, and address risks proactively. Without this discipline, even a promising application can quickly become unreliable, biased, or misaligned with its intended purpose.
References
[1] South Park Commons. (2025, March 7). CEO of Microsoft on AI Agents & Quantum | Satya Nadella [Video]. YouTube. https://www.youtube.com/watch?v=ZUPJ1ZnIZvE — see 31:05.
[2] Potluri, S. (2025, June 23). The AI Stack Is Evolving: Meet the Models Behind the Scenes. Medium — Women in Technology. Medium
[3] Košprdić, M., et. al. (2024). Verif. ai: Towards an open-source scientific generative question-answering system with referenced and verifiable answers. arXiv preprint arXiv:2402.18589. https://arxiv.org/abs/2402.18589.
[4] Bhardwaj, R., et al. “Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment” arXiv preprint arXiv:2308.09662 (2023). GitHub repository: https://github.com/declare-lab/red-instruct
Paper link: https://arxiv.org/abs/2308.09662
[5] Yair, Or, Ben Nassi, and Stav Cohen. “Invitation Is All You Need: Invoking Gemini for Workspace Agents with a Simple Google Calendar Invite.” SafeBreach Blog, 6 Aug. 2025. https://www.safebreach.com/blog/invitation-is-all-you-need-hacking-gemini/
[6] Burgess, M., & Newman, L. H. (2025, January 31). DeepSeek’s Safety Guardrails Failed Every Test Researchers Threw at Its AI Chatbot. WIRED. https://www.wired.com/story/deepseeks-ai-jailbreak-prompt-injection-attacks/?utm_source=chatgpt.com
[7] Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., … & Song, D. (2018). Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1625–1634).
[8] Burgess, M. (2025, August 6). A Single Poisoned Document Could Leak ‘Secret’ Data Via ChatGPT. WIRED. https://www.wired.com/story/poisoned-document-could-leak-secret-data-chatgpt/?utm_source=chatgpt.com
[9] Epelboim, M. (2025, April 7). Why Your AI Model Might Be Leaking Sensitive Data (and How to Stop It). NeuralTrust. NeuralTrust.
[10] Zhou, Z., Zhu, J., Yu, F., Li, X., Peng, X., Liu, T., & Han, B. (2024). Model inversion attacks: A survey of approaches and countermeasures. arXiv preprint arXiv:2411.10023. https://arxiv.org/abs/2411.10023
[11] Li, Y., Jiang, Y., Li, Z., & Xia, S. T. (2022). Backdoor learning: A survey. IEEE transactions on neural networks and learning systems, 35(1), 5–22.
[12] Daneshvar, S. S., Nong, Y., Yang, X., Wang, S., & Cai, H. (2025). VulScribeR: Exploring RAG-based Vulnerability Augmentation with LLMs. ACM Transactions on Software Engineering and Methodology.
[13] Standaert, F. X. (2009). Introduction to side-channel attacks. In Secure integrated circuits and systems (pp. 27–42). Boston, MA: Springer US.
[14] Tiku N., Schaul K. and Chen S. (2023, November 01). This is how AI image generators see the world. Washington Post. https://www.washingtonpost.com/technology/interactive/2023/ai-generated-images-bias-racism-sexism-stereotypes/ (last accessed Aug 20, 2025).
[15] Cook J., Rocktäschel T., Foerster J, Aumiller D., Wang A. (2024). TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation. arXiv preprint arXiv:2410.03608. https://arxiv.org/abs/2410.03608