How to Perform Comprehensive Large Scale LLM Validation

and evaluations are critical to ensuring robust, high-performing LLM applications. However, such topics are often overlooked in the greater scheme of LLMs.

Imagine this scenario: You have an LLM query that replies correctly 999/1000 times when prompted. However, you have to run backfilling on 1.5 million items to populate the database. In this (very realistic) scenario, you’ll experience 1500 errors for this LLM prompt alone. Now scale this up to 10s, if not 100s of different prompts, and you’ve got a real scalability issue at hand.

The solution is to validate your LLM output and ensure high performance using evaluations, which are both topics I’ll discuss in this article

This infographic highlights the main contents of this article. I’ll be discussing validation and evaluation of LLM outputs, Qualitative vs quantitative scoring, and dealing with large-scale LLM applications. Image by ChatGPT.

What is LLM validation and evaluation?

I think it’s essential to start by defining what LLM validation and evaluation are, and why they’re important for your application.

LLM validation is about validating the quality of your outputs. One common example of this is running some piece of code that checks if the LLM response answered the user’s question. Validation is important because it ensures you’re providing high-quality responses, and your LLM is performing as expected. Validation can be seen as something you do real time, on individual responses. For example, before returning the response to the user, you verify that the response is actually of high quality.

LLM evaluation is similar; however, it usually does not occur in real time. Evaluating your LLM output could, for example, involve looking at all the user queries from the last 30 days and quantitatively assessing how well your LLM performed.

Validating and evaluating your LLM’s performance is important because you will experience issues with the LLM output. It could, for example, be

Issues with input data (missing data)
An edge case your prompt is not equipped to handle
Data is out of distribution
Etc.

Thus, you need a robust solution for handling LLM output issues. You need to ensure you avoid them as often as possible and handle them in the remaining cases.

Murphy’s law adapted to this scenario:

On a large scale, everything that can go wrong, will go wrong

Qualitative vs quantitative assessments

Before moving on to the individual sections on performing validation and evaluations, I also want to comment on qualitative vs quantitative assessments of LLMs. When working with LLMs, it’s often tempting to manually evaluate the LLM’s performance for different prompts. However, such manual (qualitative) assessments are highly subject to biases. For example, you might focus most of your attention on the cases in which the LLM succeeded, and thus overestimate the performance of your LLM. Having the potential biases in mind when working with LLMs is important to mitigate the risk of biases influencing your ability to improve the model.

Large-scale LLM output validation

After running millions of LLM calls, I’ve seen a lot of different outputs, such as GPT-4o returning … or Qwen2.5 responding with unexpected Chinese characters in

These errors are incredibly difficult to detect with manual inspection because they usually happen in less than 1 out of 1000 API calls to the LLM. However, you need a mechanism to catch these issues when they occur in real time, on a large scale. Thus, I’ll discuss some approaches to handling these issues.

Simple if-else statement

The simplest solution for validation is to have some code that uses a simple if statement, which checks the LLM output. For example, if you want to generate summaries for documents, you might want to ensure the LLM output is at least above some minimal length

# LLM summay validation

# first generate summary through an LLM client such as OpenAI, Anthropic, Mistral, etc. 
summary = llm_client.chat(f"Make a summary of this document {document}")

# validate the summary
def validate_summary(summary: str) -> bool:
    if len(summary) < 20:
        return False
    return True

Then you can run the validation.

If the validation passes, you can continue as usual
If it fails, you can choose to ignore the request or utilize a retry mechanism

You can, of course, make the validate_summary function more elaborate, for example:

Utilizing regex for complex string matching
Using a library such as Tiktoken to count the number of tokens in the request
Ensure specific words are present/not present in the response
etc.

LLM as a validator

A more advanced and costly validator is using an LLM. In these cases, you utilize another LLM to assess if the output is valid. This works because validating correctness is usually a more straightforward task than generating a correct response. Using an LLM validator is essentially utilizing LLM as a judge, a topic I have written another Towards Data Science article about here.

I often utilize smaller LLMs to perform this validation task because they have faster response times, cost less, and still work well, considering that the task of validating is simpler than generating a correct response. For example, if I utilize GPT-4.1 to generate a summary, I would consider GPT-4.1-mini or GPT-4.1-nano to assess the validity of the generated summary.

Again, if the validation succeeds, you continue your application flow, and if it fails, you can ignore the request or choose to retry it.

In the case of validating the summary, I would prompt the validating LLM to look for summaries that:

Are too short
Don’t adhere to the expected answer format (for example, Markdown)
And other rules you may have for the generated summaries

Quantitative LLM evaluations

It is also super important to perform large-scale evaluations of LLM outputs. I recommend either running this continually, or in regular intervals. Quantitative LLM evaluations are also more effective when combined with qualitative assessments of data samples. For example, suppose the evaluation metrics highlight that your generated summaries are longer than what users prefer. In that case, you should manually look into those generated summaries and the documents they are based on. This helps you understand the underlying problem, which again makes solving the problem easier.

LLM as a judge

Same as with validation, you can utilize LLM as a judge for evaluation. The difference is that while validation uses LLM as a judge for binary predictions (either the output is valid, or it’s not valid), evaluation uses it for more detailed feedback. You can for example receive feedback from the LLM judge on the quality of a summary from 1-10, making it easier to distinguish medium quality summaries (around 4-6), from high quality summarie (7+).

Again, you have to consider costs when using LLM as a judge. Even though you may be utilizing smaller models, you are essentially doubling the number of LLM calls when using LLM as a judge. You can thus consider the following changes to save on costs:

Sampling data points, so you only run LLM as a judge on a subset of data points
Grouping several data points into one LLM as a judge prompt, to save on input and output tokens

I recommend detailing the judging criteria to the LLM judge. For example, you should state what constitutes a score of 1, a score of 5, and a score of 10. Using examples is often a great way of instructing LLMs, as discussed in my article on utilizing LLM as a judge. I often think about how helpful examples are for me when someone is explaining a topic, and you can thus imagine how helpful it is for an LLM.

User feedback

User feedback is a great way of receiving quantitative metrics on your LLM’s outputs. User feedback can, for example, be a thumbs-up or thumbs-down button, stating if the generated summary is satisfactory. If you combine such feedback from hundreds or thousands of users, you have a reliable feedback mechanism you can utilize to vastly improve the performance of your LLM summary generator!

These users can be your customers, so you should make it easy for them to provide feedback and encourage them to provide as much feedback as possible. However, these users can essentially be anyone who doesn’t utilize or develop your application on a day-to-day basis. It’s important to remember that any such feedback, will be incredibly valuable to improve the performance of your LLM, and it doesn’t really cost you (as the developer of the application), any time to gather this feedback..

Conclusion

In this article, I have discussed how you can perform large-scale validation and evaluation in your LLM application. Doing this is incredibly important to both ensure your application performs as expected and to improve your application based on user feedback. I recommend incorporating such validation and evaluation flows in your application as soon as possible, given the importance of ensuring that inherently unpredictable LLMs can reliably provide value in your application.

You can also read my articles on How to Benchmark LLMs with ARC AGI 3 and How to Effortlessly Extract Receipt Information with OCR and GPT-4o mini

👉 Find me on socials:

🧑‍💻 Get in touch

🔗 LinkedIn

🐦 X / Twitter

✍️ Medium

How to Perform Comprehensive Large Scale LLM Validation

Table of Contents