engineering is one of the most relevant topics in machine learning today, which is why I’m writing my third article on the topic. My goal is to both broaden my understanding of engineering contexts for LLMs and share that knowledge through my articles.
In today’s article, I’ll discuss improving the context you feed into your LLMs for question answering. Usually, this context is based on retrieval augmented generation (RAG), however, in today’s ever-shifting environment, this approach should be updated.
You can also read my previous context engineering articles:
Table of Contents
Why you should care about context engineering
First, let me highlight three key points for why you should care about context engineering:
- Better output quality by avoiding context rot. Fewer unnecessary tokens increase output quality. You can read more details about it in this article
- Cheaper (don’t send unnecessary tokens, they cost money)
- Speed (less tokens = faster response times)
These are three core metrics for most question answering systems. The output quality is naturally of utmost priority, considering users will not want to use a low-performing system.
Furthermore, price should always be a consideration, and if you can lower it (without too much engineering cost), it’s a simple decision to do so. Lastly, a faster question answering system provides a better user experience. You don’t want users waiting numerous seconds to get a response when ChatGPT will respond much faster.
The traditional question-answering approach
Traditional, in this sense, means the most common question answering approach in systems built after the release of ChatGPT. This system is traditional RAG, which works as follows:
- Fetch the most relevant documents to the user’s question, using vector similarity retrieval
- Feed relevant documents along with a question into an LLM, and receive a response
Considering its simplicity, this approach works incredibly well. Interestingly enough, we also see this happening with another traditional approach. BM25 has been around since 1994 and was, for example, recently utilized by Anthropic when they introduced Contextual Retrieval, proving how effective even simple information retrieval techniques are.
However, you can still vastly improve your question answering system by updating your RAG using some techniques I’ll describe in the next section.
Improving RAG context fetching
Even though RAG works relatively well, you can likely achieve better performance by introducing the techniques I’ll discuss in this section. The techniques I describe here all focus on improving the context you feed to the LLM. You can improve this context with two main approaches:
- Use fewer tokens on irrelevant context (for example, removing or using less material from relevant documents)
- Add documents that are relevant
Thus, you should focus on achieving one of the points above. If you think in terms of precision and recall:
- Increases precision (at the cost of recall)
- Increase recall (at the cost of precision)
This is a tradeoff you must make while working on context engineering your question answering system.
Reducing the number of irrelevant tokens
In this section, I highlight three main approaches to reduce the number of irrelevant tokens you feed into the LLMs context:
- Reranking
- Summarization
- Prompting GPT
When fetching documents from vector similarity search, they are returned in order of most relevant to least relevant, given the vector similarity score. However, this similarity score might not accurately represent which documents are most relevant.
Reranking
You can thus use a reranking model, for example, Qwen reranker, to reorder the document chunks. You can then choose to only keep the top X most relevant chunks (according to the reranker), which should remove some irrelevant documents from your context.
Summarization
You can also choose to summarize documents, reducing the number of tokens used per document. You can, for example, keep the full document from the top 10 most similar documents fetched, summarize documents ranked from 11-20, and discard the rest.
This approach will increase the likelihood that you keep the full context from relevant documents, while at least maintaining some context (the summary) from documents that are less likely to be relevant.
Prompting GPT
Lastly, you can also prompt GPT whether the fetched documents are relevant to the user query. For example, if you fetch 15 documents, you can make 15 individual LLM calls to judge if each document is relevant. You then discard documents that are deemed irrelevant. Keep in mind that these LLM calls need to be parallelized to keep response time within an acceptable limit.
Adding relevant documents
Before or after removing irrelevant documents, you also ensure you include relevant documents. I include two main approaches in this subsection:
- Better embedding models
- Searching through more documents (at the cost of lower precision)
Better embedding models
To find the best embedding models, you can go to the HuggingFace embedding model leaderboard, where Gemini and Qwen are in the top 3 as of the writing of this article. Updating your embedding model is usually a cheap approach to fetch more relevant documents. This is because running and storing embeddings is usually cheap, for example, embedding through the Gemini API, and storing vectors in Pinecone.
Search more documents
Another (relatively simple) approach to fetch more relevant documents is to fetch more documents in general. Fetching more documents naturally increases the probability that you add relevant ones. However, you have to balance this with avoiding context rot and reducing the number of irrelevant documents to a minimum. Every unnecessary token in an LLM call is, as earlier, likely to:
- Reduce output quality
- Increase cost
- Lower speed
These are all crucial aspects of a question-answering system.
Agentic search approach
I’ve discussed agentic search approaches in previous articles, for example, when I discussed Scaling your AI Search. However, in this section, I’ll dive deeper into setting up an agentic search, which replaces some or all of the vector retrieval step in your RAG.
The first step is that the user provides their question to a given set of data points, for example, a set of documents. You then set up an agentic system consisting of an orchestra agent and a list of sub-agents.
This is an example of the pipeline the agents would follow (though there are many ways to set it up).
- Orchestra agent tells two subagents to iterate over all document filenames and return relevant documents
- Relevant documents are fed back to the orchestra agent, which again releases a subagent to each of the relevant documents, to fetch subparts (chunks) of the document that are relevant to the user’s question. These chunks are then fed back to the orchestra agent
- The orchestra agent answers the user’s question, given the provided chunks
Another flow you could implement could be to store document embeddings, and replace step one with vector similarity between the user question and each document.
This agentic approach has upsides and downsides.
Upsides:
- Better chance of fetching relevant chunks than with traditional RAG
- More control over the RAG system. You can update system prompts, etc, while RAG is relatively static with its embedding similarities
Downside:
In my opinion, building such an agent-based retrieval system is a super powerful approach that can lead to amazing results. The consideration you have to make when building such a system is whether the increased quality you’ll (likely) see is worth the increase in cost.
Other context engineering aspects
In this article, I’ve mainly covered context engineering for the documents we fetch in a question answering system. However, there are also other aspects you should be aware of, mainly:
- The system/user prompt you are using
- Other information fed into the prompt
The prompt you write for your question answering system should be precise, structured, and avoid irrelevant information. You can read many other articles on the topic of structuring prompts, and you can typically ask an LLM to improve these aspects of your prompt.
Sometimes, you also feed other information into your prompt. A common example is feeding in metadata, for example, data covering information about the user, such as:
- Name
- Job role
- What they usually search for
- etc
Whenever you add such information, you should always ask yourself:
Does amending this information help my question answering system answer the question?
Sometimes the answer is yes, other times it’s no. The most important part is that you made a rational decision on whether the information is needed in the prompt. If you can’t justify having this information in the prompt, it should usually be removed.
Conclusion
In this article, I have discussed context engineering for your question answering system, and why it’s important. Question answering systems usually consist of an initial step to fetch relevant information. The focus on this information should be to reduce the number of irrelevant tokens to a minimum, while also including as many relevant pieces of information as possible.
👉 Find me on socials:
🧑💻 Get in touch
✍️ Medium
You can also read my in-depth article on Anthropic’s contextual retrieval below: