How to Scale Your AI Search to Handle 10M Queries with 5 Powerful Techniques

has become prevalent since the introduction of LLMs in 2022. Retrieval augmented generation (RAG) systems quickly adapted to utilizing these efficient LLMs for better question answering. AI search is extremely powerful because it provides the user with rapid access to large amounts of information. You, for example, see AI search systems with

ChatGPT
Legal AI, such as Harvey
Whenever you perform a Google Search and Gemini responds

Essentially, wherever you have an AI search, RAG is usually the backbone. However, searching with AI is much more than simply using RAG.

In this article, I’ll discuss how to perform search with AI, and how you can scale your system, both in terms of quality and scalability.

This infographic highlights the contents of this article. I’ll discuss systems using AI search, RAG, scalability, and evaluation throughout the article. Image by ChatGPT.

You can also learn about how to improve your RAG 50% with Contextual Retrieval, or you can read about ensuring reliability in LLM applications.

Motivation

My motivation for writing this article is that searching with AI has quickly become a standard part of our day-to-day. You see AI searches everywhere, for example, when you Google something, and Gemini provides you with an answer. Utilizing AI this way is extremely time-efficient, since I, as the person querying, do not have to enter any links, and I simply have a summarized answer right in front of me.

Thus, if you’re building an application, it’s important to know how to build such a system, to understand its inner workings.

Building your AI search system

There are several vital aspects to consider when building your search system. In this section, I’ll cover the most important aspects.

RAG

This figure showcases Nvidia’s blueprint for RAG, using their internal tools and models. There is a lot of information in the figure, but the main point is that the RAG fetches the most important documents using vector similarity and feeds them to an LLM for a response to the user’s question. Image from https://github.com/NVIDIA-AI-Blueprints/rag (Apache 2.0 License)

First, you need to build the basics. The core component of any AI search is usually a RAG system. The reason for this is that RAG is an extremely efficient way of accessing data, and it’s relatively simple to set up. Essentially, you can make a pretty good AI search with very little effort, which is why I always recommend starting off with implementing RAG.

You can utilize end-to-end RAG providers such as Elysia; however, if you want more flexibility, creating your own RAG pipeline is often a good option. Essentially, RAG consists of the following core steps:

Embed all of your data, so we can perform embedding similarity calculations on it. We split the data into chunks of set sizes (for example, 500 tokens).
When a user enters a query, we embed the query (with the same embedding engine as used in step 1) and find the most similar chunks using vector similarity.
Lastly, we feed these chunks, along with the user question, into an LLM such as GPT-4o, which provides us with an answer.

And that’s it. If you implement this, you’ve already made an AI search that will perform relatively well in most scenarios. However, if you really want to make a good search, you need to incorporate more advanced RAG techniques, which I will cover later in this article.

Scalability

Scalability is an important aspect of building your search system. I’ve divided the scalability aspect into two main areas:

Response time (how long the user has to wait for an answer) should be as low as possible.
Uptime (the percentage of time your platform is up and running) should be as high as possible.

Response time

You have to ensure you reply quickly to user queries. With a standard RAG system, this is usually not an issue, considering:

Your dataset is embedded beforehand (takes no time during a user query).
Embedding the user query is nearly instant.
Performing vector similarity search is also near instant (because computation can be parallelized)

Thus, the LLM response time is usually the deciding factor in how fast your RAG performs. To minimize this time, you should consider the following:

Use an LLM with a fast response time.
- GPT-4o/GPT-4.1 was a bit slower, but OpenAI has massively improved speed with GPT-5.
- The Gemini flash 2.0 models have always been very fast (the response time here is ludicrously fast).
- Mistral also provides a fast LLM service.
Implement streaming, so you don’t have to wait for all the output tokens to be generated before showing a response.

The last point on streaming is very important. As a user, I hate to wait for an application without receiving any feedback on what’s happening. For example, imagine waiting for the Cursor agent to perform a large number of changes, without seeing anything on screen before it’s done.

That’s why streaming, or at least providing the user with some feedback while waiting, is incredibly important. I summarized this in a quote below.

It’s usually not about the response time as a number, but rather the user’s perceived response time. If you fill the users’s wait time with feedback, the user will perceive it the response time to be faster.

It’s also important to consider that when you expand and improve your AI search, you will typically add more components. These components will inevitably take more time. However, you should always look for parallelized operations. The biggest threat to your response time is sequential operations, and they should be reduced to an absolute minimum.

Uptime

Uptime is also important when hosting an AI search. You essentially have to have a service up and running at all times, which can be difficult when dealing with unpredictable LLMs. I wrote an article about ensuring reliability in LLM applications below. If you want to learn more about how to make your application robust:

How to Scale Your AI Search to Handle 10M Queries with 5 Powerful Techniques

Table of Contents

Motivation

Building your AI search system

RAG

Scalability

Evaluation

Techniques to improve your AI search

Contextual Retrieval

BM25 outside RAG

Agents

Conclusion

Leave a Reply Cancel reply

How to Scale Your AI Search to Handle 10M Queries with 5 Powerful Techniques

Table of Contents

Motivation

Building your AI search system

RAG

Scalability

Evaluation

Techniques to improve your AI search

Contextual Retrieval

BM25 outside RAG

Agents

Conclusion

Related Posts

Cohesity Targets 2026 IPO After Veritas Merger

Tesla Optimus Robot Integrates XAI Grok Assistant

Leave a Reply Cancel reply