is the science of providing LLMs with the correct context to maximize performance. When you work with LLMs, you typically create a system prompt, asking the LLM to perform a certain task. However, when working with LLMs from a programmer’s perspective, there are more elements to consider. You have to determine what other data you can feed your LLM to improve its ability to perform the task you asked it to do.
In this article, I will discuss the science of context engineering and how you can apply context engineering techniques to improve your LLM’s performance.
You can also read my articles on Reliability for LLM Applications and Document QA using Multimodal LLMs
Table of Contents
Definition
Before I start, it’s important to define the term context engineering. Context engineering is essentially the science of deciding what to feed into your LLM. This can, for example, be:
- The system prompt, which tells the LLM how to act
- Document data fetch using RAG vector search
- Few-shot examples
- Tools
The closest previous description of this has been the term prompt engineering. However, prompt engineering is a less descriptive term, considering it implies only changing the system prompt you are feeding to the LLM. To get maximum performance out of your LLM, you have to consider all the context you are feeding into it, not only the system prompt.
Motivation
My initial motivation for this article came from reading this Tweet by Andrej Karpathy.
I really agreed with the point Andrej made in this tweet. Prompt engineering is definitely an important science when working with LLMs. However, prompt engineering doesn’t cover everything we input into LLMs. In addition to the system prompt you write, you also have to consider elements such as:
- Which data should you insert into your prompt
- How do you fetch that data
- How to only provide relevant information to the LLM
- Etc.
I will discuss all of these points throughout this article.
API vs Console usage
One important difference to clarify is whether you are using the LLMs from an API (calling it with code), or via the console (for example, via the ChatGPT website or application). Context engineering is definitely important when working with LLMs through the console; however, my focus in this article will be on API usage. The reason for this is that when using an API, you have more options for dynamically changing the context you are feeding the LLM. For example, you can do RAG, where you first perform a vector search, and only feed the LLM the most important bits of information, rather than the entire database.
These dynamic changes are not available in the same way when interacting with LLMs through the console; thus, I will focus on using LLMs through an API.
Context engineering techniques
Zero-shot prompting
Zero-shot prompting is the baseline for context engineering. Doing a task zero-shot means the LLM is performing a task it hasn’t seen before. You are essentially only providing a task description as context for the LLM. For example, providing an LLM with a long text and asking it to classify the text into class A or B, according to some definition of the classes. The context (prompt) you are feeding the LLM could look something like this:
You are an expert text classifier, and tasked with classifying texts into
class A or class B.
- Class A: The text contains a positive sentiment
- Class B: The next contains a negative sentiment
Classify the text: {text}
Depending on the task, this could work very well. LLMs are generalists and are able to perform most simple text-based tasks. Classifying a text into one of two classes will usually be a simple task, and zero-shot prompting will thus usually work quite well.
Few-shot prompting
This infographic highlights how to perform few-shot prompting:

The follow-up from zero-shot prompting is few-shot prompting. With few-shot prompting, you provide the LLM with a prompt similar to the one above, but you also provide it with examples of the task it will perform. This added context will help the LLM improve at performing the task. Following up on the prompt above, a few-shot prompt could look like:
You are an expert text classifier, and tasked with classifying texts into
class A or class B.
- Class A: The text contains a positive sentiment
- Class B: The next contains a negative sentiment
{text 1} -> Class A
{text 2} -> class B
Classify the text: {text}
You can see I’ve provided the model some examples wrapped in
Few-shot prompting works well because you are providing the model with examples of the task you are asking it to perform. This usually increases performance.
You can imagine this works well on humans as well. If you ask a human a task they have never done before, just by describing the task, they might perform decently (of course, depending on the difficulty of the task). However, if you also provide the human with examples, their performance will usually increase.
Overall, I find it useful to think about LLM prompts as if I’m asking a human to perform a task. Imagine instead of prompting an LLM, you simply provide the text to a human, and you ask yourself the question:
Given this prompt, and no other context, will the human be able to perform the task?
If the answer is no, you should work on clarifying and improving your prompt.
I also want to mention dynamic few-shot prompting, considering it’s a technique I’ve had a lot of success with. Traditionally, with few-shot prompting, you have a set list of examples you feed into every prompt. However, you can often achieve higher performance using dynamic few-shot prompting.
Dynamic few-shot prompting means selecting the few-shot examples dynamically when creating the prompt for a task. For example, if you are asked to classify a text into classes A and B, and you already have a list of 200 texts and their corresponding labels. You can then perform a similarity search between the new text you are classifying and the example texts you already have. Continuing, you can measure the vector similarity between the texts and only choose the most similar texts (out of the 200 texts) to feed into your prompt as context. This way, you’re providing the model with more relevant examples of how to perform the task.
RAG
Retrieval augmented generation is a well-known technique for increasing the knowledge of LLMs. Assume you already have a database consisting of thousands of documents. You now receive a question from a user, and have to answer it, given the knowledge inside your database.
Unfortunately, you can’t feed the entire database into the LLM. Even though we have LLMs such as Llama 4 Scout with a 10-million context length window, databases are usually much larger. You therefore have to find the most relevant information in the database to feed into your LLM. RAG does this similarly to dynamic few-shot prompting:
- Perform a vector search
- Find the most similar documents to the user question (most similar documents are assumed to be most relevant)
- Ask the LLM to answer the question, given the most similar documents
By performing RAG, you are doing context engineering by only providing the LLM with the most relevant data for performing its task. To improve the performance of the LLM, you can work on the context engineering by improving your RAG search. This can, for example, be done by improving the search to find only the most relevant documents.
You can read more about RAG in my article about developing a RAG system for your personal data:
Tools (MCP)
You can also provide the LLM with tools to call, which is an important part of context engineering, especially now that we see the rise of AI agents. Tool calling today is often done using Model Context Protocol (MCP), a concept started by Anthropic.
AI agents are LLMs capable of calling tools and thus performing actions. An example of this could be a weather agent. If you ask an LLM without access to tools about the weather in New York, it will not be able to provide an accurate response. The reason for this is naturally that information about the weather needs to be fetched in real time. To do this, you can, for example, give the LLM a tool such as:
@tool
def get_weather(city):
# code to retrieve the current weather for a city
return weather
If you give the LLM access to this tool and ask it about the weather, it can then search for the weather for a city and provide you with an accurate response.
Providing tools for LLMs is incredibly important, as it significantly enhances the abilities of the LLM. Other examples of tools are:
- Search the internet
- A calculator
- Search via Twitter API
Topics to consider
In this section, I make a few notes on what you should consider when creating the context to feed into your LLM
Utilization of context length
The context length of an LLM is an important consideration. As of July 2025, you can feed most frontier model LLMs with over 100,000 input tokens. This provides you with a lot of options for how to utilize this context. You have to consider the tradeoff between:
- Including a lot of information in a prompt, thus risking some of the information getting lost in the context
- Missing some important information in the prompt, thus risking the LLM not having the required context to perform a specific task
Usually, the only way to figure out the balance, is to test your LLMs performance. For example with a classificaition task, you can check the accuracy, given different prompts.
If I discover the context to be too long for the LLM to work effectively, I sometimes split a task into several prompts. For example, having one prompt summarize a text, and a second prompt classifying the text summary. This can help the LLM utilize its context effectively and thus increase performance.
Furthermore, providing too much context to the model can have a significant downside, as I describe in the next section:
Context rot
Last week, I read an interesting article about context rot. The article was about the fact that increasing the context length lowers LLM performance, even though the task difficulty doesn’t increase. This implies that:
Providing an LLM irrelevant information, will decrease its ability to perform tasks succesfully, even if task difficulty does not increase
The point here is essentially that you should only provide relevant information to your LLM. Providing other information decreases LLM performance (i.e., performance is not neutral to input length)
Conclusion
In this article, I have discussed the topic of context engineering, which is the process of providing an LLM with the right context to perform its task effectively. There are a lot of techniques you can utilize to fill up the context, such as few-shot prompting, RAG, and tools. These are all powerful techniques you can use to significantly improve an LLM’s ability to perform a task effectively. Furthermore, you also have to consider the fact that providing an LLM with too much context also has downsides. Increasing the number of input tokens reduces performance, as you could read about in the article about context rot.
👉 Follow me on socials:
🧑💻 Get in touch
🔗 LinkedIn
🐦 X / Twitter
✍️ Medium
🧵 Threads