Home » How to Ensure Reliability in LLM Applications

How to Ensure Reliability in LLM Applications

have entered the world of computer science at a record pace. LLMs are powerful models capable of effectively performing a wide variety of tasks. However, LLM outputs are stochastic, making them unreliable. In this article, I discuss how you can ensure reliability on your LLM applications by properly prompting the model and handling the output.

This infographic highlights the contents of this article. I will mainly discuss ensuring output consistency and handling errors. Image by ChatGPT.

You can also read my articles on Attending NVIDIA GTC Paris 2025 and Creating Powerful Embeddings for Machine Learning.

Table of Contents

Motivation

My motivation for this article is that I am consistently developing new applications using LLMs. LLMs are generalized tools that can be applied to most text-dependent tasks such as classification, summarization, information extraction, and much more. Furthermore, the rise of vision language models also enable us to handle images similar to how we handle text.

I often encounter the problem that my LLM applications are inconsistent. Sometimes the LLM doesn’t respond in the desired format, or I am unable to properly parse the LLM response. This is a huge problem when you are working in a production setting and are fully dependent on consistency in your application. I will thus discuss the techniques I use to ensure reliability for my applications in a production setting.

Ensuring output consistency

Markup tags

To ensure output consistency, I use a technique where my LLM answers in markup tags. I use a system prompt like:

prompt = f"""
Classify the text into "Cat" or "Dog"

Provide your response in   tags

"""

And the model will almost always respond with:

Cat

or 

Dog

You can now easily parse out the response using the following code:

def _parse_response(response: str):
    return response.split("")[1].split("")[0]

The reason using markup tags works so well is that this is how the model is trained to behave. When OpenAI, Qwen, Google, and others train these models, they use markup tags. The models are thus super effective at utilizing these tags and will, in almost all cases, adhere to the expected response format.

For example, with reasoning models, which have been on the rise lately, the models first do their thinking enclosed in tags, and then provide their answer to the user.


Furthermore, I also try to use as many markup tags as possible elsewhere in my prompts. For example, if I am providing a few shot examples to my model, I will do something like:

prompt = f"""
Classify the text into "Cat" or "Dog"

Provide your response in   tags


This is an image showing a cat -> Cat


This is an image showing a dog -> Dog

"""

I do two things that help the model perform here:

  1. I provide examples in tags.
  2. In my examples, I ensure to adhere to my own expected response format, using the

Using markup tags, you can thus ensure a high level of output consistency from your LLM

Output validation

Pydantic is a tool you can use to ensure and validate the output of your LLMs. You can define types and validate that the output of the model adheres to the type we expect. For example, you can follow the example below, based on this article:

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()


class Profile(BaseModel):
    name: str
    email: str
    phone: str

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "Return the `name`, `email`, and `phone` of user {user} in a json object."
        },
    ]
)

Profile.model_validate_json(resp.choices[0].message.content)

As you can see, we prompt GPT to respond with a JSON object, and we then run Pydantic to ensure the response is as we expect.


I would also like to note that sometimes it’s easier to simply create your own output validation function. In the last example, the only requirements for the response object are essentially that the response object contains the keys name, email, and phone, and that all of those are of the string type. You can validate this in Python with a function:

def validate_output(output: str):
    assert "name" in output and isinstance(output["name"], str)
    assert "email" in output and isinstance(output["email"], str)
    assert "phone" in output and isinstance(output["phone"], str)

With this, you do not have to install any packages, and in a lot of cases, it is easier to set up.

Tweaking the system prompt

You can also make several other tweaks to your system prompt to ensure a more reliable output. I always recommend making your prompt as structured as possible, using:

  • Markup tags as mentioned earlier
  • Lists, such as the one I am writing in here

In general, you should also always ensure clear instructions. You can use the following to ensure the quality of your prompt

If you gave the prompt to another human, that had never seen the task before, and with no prior knowledge of the task. Would the human be able to perform the task effectively?

If you cannot have a human do the task, you usually cannot expect an AI to do it (at least for now).

Handling errors

Errors are inevitable when dealing with LLMs. If you perform enough API calls, it is almost certain that sometimes the response will not be in your desired format, or another issue.

In these scenarios, it’s important that you have a robust application equipped to handle such errors. I use the following techniques to handle errors:

  • Retry mechanism
  • Increase the temperature
  • Have backup LLMs

Now, let me elaborate on each point.

Exponential backoff retry mechanism

It’s important to have a retry mechanism in place, considering a lot of issues can occur when making an API call. You might encounter issues such as rate limiting, incorrect output format, or a slow response. In these scenarios, you must ensure to wrap the LLM call in a try-catch and retry. Usually, it’s also smart to use an exponential backoff, especially for rate-limiting errors. The reason for this is to ensure you wait long enough to avoid further rate-limiting issues.

Temperature increase

I also sometimes recommend increasing the temperature a bit. If you set the temperature to 0, you tell the model to act deterministically. However, sometimes this can have a negative effect.

For example, if you have an input example where the model failed to respond in the proper output format. If you retry this using a temperature of 0, you are likely to just experience the same issue. I thus recommend you set the temperature to a bit higher, for example 0.1, to ensure some stochasticness in the model, while also ensuring its outputs are relatively deterministic.

This is the same logic that a lot of agents use: a higher temperature.

They need to avoid being stuch in a loop. Having a higher temperature can help them avoid repetitive errors.

Backup LLMs

Another powerful method to deal with errors is to have backup LLMs. I recommend using a chain of LLM providers for all your API calls. For example, you first try OpenAI, if that fails, you use Gemini, and if that fails, you can use Claude.

This ensures reliability in the event of provider-specific issues. These could be issues such as:

  • The server is down (for example, if OpenAI’s API is not available for a period of time)
  • Filtering (sometimes, an LLM provider will refuse to answer your request if it believes your request is in violation of jailbreak policies or content moderation)

In general, it is simply good practice not to be fully dependent on one provider.

Conclusion

In this article, I have discussed how you can ensure reliability in your LLM application. LLM applications are inherently stochastic because you cannot directly control the output of an LLM. It is thus important to ensure you have proper policies in place, both to minimize the errors that occur and to handle the errors when they occur.

I have discussed the following approaches to minimize errors and handle errors:

  • Markup tags
  • Output validation
  • Tweaking the system prompt
  • Retry mechanism
  • Increase the temperature
  • Have backup LLMs

If you combine these techniques into your application, you can achieve both a powerful and robust LLM application.

👉 Follow me on socials:

🧑‍💻 Get in touch
🌐 Personal Blog
🔗 LinkedIn
🐦 X / Twitter
✍️ Medium
🧵 Threads

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *