have entered the world of computer science at a record pace. LLMs are powerful models capable of effectively performing a wide variety of tasks. However, LLM outputs are stochastic, making them unreliable. In this article, I discuss how you can ensure reliability on your LLM applications by properly prompting the model and handling the output.
You can also read my articles on Attending NVIDIA GTC Paris 2025 and Creating Powerful Embeddings for Machine Learning.
Table of Contents
Motivation
My motivation for this article is that I am consistently developing new applications using LLMs. LLMs are generalized tools that can be applied to most text-dependent tasks such as classification, summarization, information extraction, and much more. Furthermore, the rise of vision language models also enable us to handle images similar to how we handle text.
I often encounter the problem that my LLM applications are inconsistent. Sometimes the LLM doesn’t respond in the desired format, or I am unable to properly parse the LLM response. This is a huge problem when you are working in a production setting and are fully dependent on consistency in your application. I will thus discuss the techniques I use to ensure reliability for my applications in a production setting.
Ensuring output consistency
Markup tags
To ensure output consistency, I use a technique where my LLM answers in markup tags. I use a system prompt like:
prompt = f"""
Classify the text into "Cat" or "Dog"
Provide your response in tags
"""
And the model will almost always respond with:
Cat
or
Dog
You can now easily parse out the response using the following code:
def _parse_response(response: str):
return response.split("")[1].split(" ")[0]
The reason using markup tags works so well is that this is how the model is trained to behave. When OpenAI, Qwen, Google, and others train these models, they use markup tags. The models are thus super effective at utilizing these tags and will, in almost all cases, adhere to the expected response format.
For example, with reasoning models, which have been on the rise lately, the models first do their thinking enclosed in
Furthermore, I also try to use as many markup tags as possible elsewhere in my prompts. For example, if I am providing a few shot examples to my model, I will do something like:
prompt = f"""
Classify the text into "Cat" or "Dog"
Provide your response in tags
This is an image showing a cat -> Cat
This is an image showing a dog -> Dog
"""
I do two things that help the model perform here:
- I provide examples in
tags. - In my examples, I ensure to adhere to my own expected response format, using the
Using markup tags, you can thus ensure a high level of output consistency from your LLM
Output validation
Pydantic is a tool you can use to ensure and validate the output of your LLMs. You can define types and validate that the output of the model adheres to the type we expect. For example, you can follow the example below, based on this article:
from pydantic import BaseModel
from openai import OpenAI
client = OpenAI()
class Profile(BaseModel):
name: str
email: str
phone: str
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": "Return the `name`, `email`, and `phone` of user {user} in a json object."
},
]
)
Profile.model_validate_json(resp.choices[0].message.content)
As you can see, we prompt GPT to respond with a JSON object, and we then run Pydantic to ensure the response is as we expect.
I would also like to note that sometimes it’s easier to simply create your own output validation function. In the last example, the only requirements for the response object are essentially that the response object contains the keys name, email, and phone, and that all of those are of the string type. You can validate this in Python with a function:
def validate_output(output: str):
assert "name" in output and isinstance(output["name"], str)
assert "email" in output and isinstance(output["email"], str)
assert "phone" in output and isinstance(output["phone"], str)
With this, you do not have to install any packages, and in a lot of cases, it is easier to set up.
Tweaking the system prompt
You can also make several other tweaks to your system prompt to ensure a more reliable output. I always recommend making your prompt as structured as possible, using:
- Markup tags as mentioned earlier
- Lists, such as the one I am writing in here
In general, you should also always ensure clear instructions. You can use the following to ensure the quality of your prompt
If you gave the prompt to another human, that had never seen the task before, and with no prior knowledge of the task. Would the human be able to perform the task effectively?
If you cannot have a human do the task, you usually cannot expect an AI to do it (at least for now).
Handling errors
Errors are inevitable when dealing with LLMs. If you perform enough API calls, it is almost certain that sometimes the response will not be in your desired format, or another issue.
In these scenarios, it’s important that you have a robust application equipped to handle such errors. I use the following techniques to handle errors:
- Retry mechanism
- Increase the temperature
- Have backup LLMs
Now, let me elaborate on each point.
Exponential backoff retry mechanism
It’s important to have a retry mechanism in place, considering a lot of issues can occur when making an API call. You might encounter issues such as rate limiting, incorrect output format, or a slow response. In these scenarios, you must ensure to wrap the LLM call in a try-catch and retry. Usually, it’s also smart to use an exponential backoff, especially for rate-limiting errors. The reason for this is to ensure you wait long enough to avoid further rate-limiting issues.
Temperature increase
I also sometimes recommend increasing the temperature a bit. If you set the temperature to 0, you tell the model to act deterministically. However, sometimes this can have a negative effect.
For example, if you have an input example where the model failed to respond in the proper output format. If you retry this using a temperature of 0, you are likely to just experience the same issue. I thus recommend you set the temperature to a bit higher, for example 0.1, to ensure some stochasticness in the model, while also ensuring its outputs are relatively deterministic.
This is the same logic that a lot of agents use: a higher temperature.
They need to avoid being stuch in a loop. Having a higher temperature can help them avoid repetitive errors.
Backup LLMs
Another powerful method to deal with errors is to have backup LLMs. I recommend using a chain of LLM providers for all your API calls. For example, you first try OpenAI, if that fails, you use Gemini, and if that fails, you can use Claude.
This ensures reliability in the event of provider-specific issues. These could be issues such as:
- The server is down (for example, if OpenAI’s API is not available for a period of time)
- Filtering (sometimes, an LLM provider will refuse to answer your request if it believes your request is in violation of jailbreak policies or content moderation)
In general, it is simply good practice not to be fully dependent on one provider.
Conclusion
In this article, I have discussed how you can ensure reliability in your LLM application. LLM applications are inherently stochastic because you cannot directly control the output of an LLM. It is thus important to ensure you have proper policies in place, both to minimize the errors that occur and to handle the errors when they occur.
I have discussed the following approaches to minimize errors and handle errors:
- Markup tags
- Output validation
- Tweaking the system prompt
- Retry mechanism
- Increase the temperature
- Have backup LLMs
If you combine these techniques into your application, you can achieve both a powerful and robust LLM application.
👉 Follow me on socials:
🧑💻 Get in touch
🌐 Personal Blog
🔗 LinkedIn
🐦 X / Twitter
✍️ Medium
🧵 Threads