has become prevalent since the introduction of LLMs in 2022. Retrieval augmented generation (RAG) systems quickly adapted to utilizing these efficient LLMs for better question answering. AI search is extremely powerful because it provides the user with rapid access to large amounts of information. You, for example, see AI search systems with
- ChatGPT
- Legal AI, such as Harvey
- Whenever you perform a Google Search and Gemini responds
Essentially, wherever you have an AI search, RAG is usually the backbone. However, searching with AI is much more than simply using RAG.
In this article, I’ll discuss how to perform search with AI, and how you can scale your system, both in terms of quality and scalability.
Table of Contents
You can also learn about how to improve your RAG 50% with Contextual Retrieval, or you can read about ensuring reliability in LLM applications.
Motivation
My motivation for writing this article is that searching with AI has quickly become a standard part of our day-to-day. You see AI searches everywhere, for example, when you Google something, and Gemini provides you with an answer. Utilizing AI this way is extremely time-efficient, since I, as the person querying, do not have to enter any links, and I simply have a summarized answer right in front of me.
Thus, if you’re building an application, it’s important to know how to build such a system, to understand its inner workings.
Building your AI search system
There are several vital aspects to consider when building your search system. In this section, I’ll cover the most important aspects.
RAG

First, you need to build the basics. The core component of any AI search is usually a RAG system. The reason for this is that RAG is an extremely efficient way of accessing data, and it’s relatively simple to set up. Essentially, you can make a pretty good AI search with very little effort, which is why I always recommend starting off with implementing RAG.
You can utilize end-to-end RAG providers such as Elysia; however, if you want more flexibility, creating your own RAG pipeline is often a good option. Essentially, RAG consists of the following core steps:
- Embed all of your data, so we can perform embedding similarity calculations on it. We split the data into chunks of set sizes (for example, 500 tokens).
- When a user enters a query, we embed the query (with the same embedding engine as used in step 1) and find the most similar chunks using vector similarity.
- Lastly, we feed these chunks, along with the user question, into an LLM such as GPT-4o, which provides us with an answer.
And that’s it. If you implement this, you’ve already made an AI search that will perform relatively well in most scenarios. However, if you really want to make a good search, you need to incorporate more advanced RAG techniques, which I will cover later in this article.
Scalability
Scalability is an important aspect of building your search system. I’ve divided the scalability aspect into two main areas:
- Response time (how long the user has to wait for an answer) should be as low as possible.
- Uptime (the percentage of time your platform is up and running) should be as high as possible.
Response time
You have to ensure you reply quickly to user queries. With a standard RAG system, this is usually not an issue, considering:
- Your dataset is embedded beforehand (takes no time during a user query).
- Embedding the user query is nearly instant.
- Performing vector similarity search is also near instant (because computation can be parallelized)
Thus, the LLM response time is usually the deciding factor in how fast your RAG performs. To minimize this time, you should consider the following:
- Use an LLM with a fast response time.
- GPT-4o/GPT-4.1 was a bit slower, but OpenAI has massively improved speed with GPT-5.
- The Gemini flash 2.0 models have always been very fast (the response time here is ludicrously fast).
- Mistral also provides a fast LLM service.
- Implement streaming, so you don’t have to wait for all the output tokens to be generated before showing a response.
The last point on streaming is very important. As a user, I hate to wait for an application without receiving any feedback on what’s happening. For example, imagine waiting for the Cursor agent to perform a large number of changes, without seeing anything on screen before it’s done.
That’s why streaming, or at least providing the user with some feedback while waiting, is incredibly important. I summarized this in a quote below.
It’s usually not about the response time as a number, but rather the user’s perceived response time. If you fill the users’s wait time with feedback, the user will perceive it the response time to be faster.
It’s also important to consider that when you expand and improve your AI search, you will typically add more components. These components will inevitably take more time. However, you should always look for parallelized operations. The biggest threat to your response time is sequential operations, and they should be reduced to an absolute minimum.
Uptime
Uptime is also important when hosting an AI search. You essentially have to have a service up and running at all times, which can be difficult when dealing with unpredictable LLMs. I wrote an article about ensuring reliability in LLM applications below. If you want to learn more about how to make your application robust:
These are the most important aspects to consider to ensure a high uptime for your search service:
- Have error handling for everything that deals with LLMs. When you’re making millions of LLM calls, things will go wrong. It could be
- OpenAI content filtering
- Token limits (which are notoriously difficult to increase at some providers)
- LLM service is slow, or their server is down
- …
- Have backups. Wherever you have an LLM call, you should have one or two backup providers ready to step in when something goes wrong.
- Proper tests before deployments
Evaluation
When you are building an AI search system, evaluations should be one of your top priorities. There’s no point in continuing to build features if you can’t test your search and figure out where you’re thriving and where you’re struggling. I’ve written two articles on this topic: How to Develop Powerful Internal LLM Benchmarks and How to Use LLMs for Powerful Automatic Evaluations.
In summary, I recommend doing the following to evaluate your AI search and maintain high quality:
- Incorporate with a prompt engineering platform to version your prompts, test before new prompts are released, and run large-scale experiments.
- Do regular analysis of last month’s user queries. Annotate which ones succeeded, which ones failed, along with a reason why this is the case.
I would then group the queries that went wrong by their reason. For example:
- User intent was unclear
- Issues with the LLM provider
- The fetched context did not contain the necessary information to answer the query.
- …
And then begin working on the most pressing issues that are causing the most unsuccessful user queries.
Techniques to improve your AI search
There are a plethora of techniques you can utilize to improve your AI search. In this section, I cover a few of them.
Contextual Retrieval
This technique was first introduced by Anthopric in 2024. I also wrote an extensive article on contextual retrieval if you want to learn more details.
The figure below highlights the pipeline for contextual retrieval. What you do is you still maintain the vector database you had in your RAG system, but now you also incorporate a BM25 index (a keyword search) to search for relevant documents. This works well because sometimes users query using particular keywords, and BM25 is better suited for such keyword search, compared to vector similarity search.

BM25 outside RAG
Another option is quite similar to contextual retrieval; however, in this instance, you’re performing BM25 outside of the RAG (in contextual retrieval, you perform BM25 to fetch the most important documents for RAG). This can also be a powerful technique, considering users sometimes utilize your AI search as a basic keyword search.
However, when implementing this, I recommend developing a router agent that detects if we should utilize RAG or BM25 directly to answer the user query. If you want to learn more about creating AI router agents, or in general building effective agents, Anthopric has written an extensive article on the topic.
Agents
Agents are the latest hype within the LLM space. However, they are not simply hype; they can also be used to effectively improve your AI search. You can, for example, create subagents that can find relevant material, similar to fetching relevant documents with RAG, but instead of having an agent look through entire documents itself. This is partly how deep research tools from OpenAI, Gemini, and Anthropic work, and is an extremely effective (though expensive) way of performing AI search. You can read more about how Anthropic built its deep research using agents here.
Conclusion
In this article, I have covered how you can build and improve your AI search capabilities. I first elaborated on why knowing how to build such applications is important and why you should focus on it. Furthermore, I highlighted how you can develop an effective AI search with basic RAG, and then improve on it using techniques such as contextual retrieval.
👉 Find me on socials:
🧑💻 Get in touch
✍️ Medium