Home » Why You Need RAG to Stay Relevant as a Data Scientist

Why You Need RAG to Stay Relevant as a Data Scientist


Image by Author | Canva

 

If you work in a data-related field, you should update yourself regularly. Data scientists use different tools for tasks like data visualization, data modeling, and even warehouse systems.

Like this, AI has changed data science from A to Z. If you are in the way of searching for jobs related to data science, you probably heard the term RAG.

In this article, we’ll break down RAG. Starting with the academic article that introduced it and how it’s now used to cut costs when working with large language models (LLMs). But first, let’s cover the basics.

 

What is Retrieval-Augmented Generation (RAG)?

 
What is Retrieval-Augmented Generation (RAG)
 
Patrick Lewis first introduced RAG in this academic article first in 2020. It combines two key elements: a retriever and a generator.

The idea behind this is simple. Instead of generating answers from parameters, the RAG can collect relevant information from the document.

 

What is a retriever?

A retriever is used to collect relevant information from the document. But how?

Let’s consider this. You have a massive Excel sheet. Let’s say it is 20 MB, with thousands of rows. You want to search call_date for user_id = 10234.

Thanks to this retriever, instead of looking at the entire document, RAG will only search the relevant part.

 
What is a retriever in RAG
 

But how is this helpful for us? If you search the entire document, you will spend a lot of tokens. As you probably know, LLM’s API usage is calculated using tokens.

Let’s visit https://platform.openai.com/tokenizer and see how this calculation is done. For instance, if you paste the introduction of this article. It cost 123 tokens.

You must check this to calculate the cost using LLM’s API. For instance, if you consider using a Word document, say 10 MB, it could be thousands of tokens. Each time you upload this document using LLM’s API, the cost multiplies.

By using RAG, you can select only the relevant part of the document, reducing the number of tokens so that you will pay less. It is straightforward.

 
What is a retriever in RAG

 

How Does This Retriever Do This?

Before retrieval begins, documents are split into small chunks, paragraphs. Each chunk is converted into a dense vector using an embedding model (OpenAI Embeddings, Sentence-BERT, etc.).

So when a user wants an operation like asking what the call date is, the retriever compares the query vector to all chunk vectors and selects the most similar ones. It is brilliant, right?

 

What Is A Generator?

As we explained above, after the retriever finds the most relevant documents, the generator takes over. It generates an answer using the user’s query and a retrieved document.

By using this method, you also minimize the risk of hallucination. Because instead of generating an answer freely from the data the AI was trained on, the model grounds its response on an actual document you provided.

 

The Context Window Evolution

 
The initial models, like GPT-2 have small context windows, around 2048 tokens. That’s why these models don’t have file uploading features. If you remember, after a few models, ChatGPT offers a data uploading feature because the context window evolved to that.

Advanced models like GPT-4o have a 128K token limit, which supports the data uploading feature and might show RAG redundant, in case of the context window. But that’s where the cost-reducing requests enter.

So now, one of the reasons users are using RAG is to reduce cost, but not just that. Because LLM usage costs are decreasing, GPT 4.1 introduced a context window up to 1 million tokens, a fantastic increase. Now, RAG has also evolved.

 

Industry Related Practice

 
Now, LLMs are evolving into agents. They should automate your tasks instead of generating just answers. Some companies are developing models that even control your keywords and mouse.

So for these cases, you should not take a chance of hallucination. So here RAG comes into the scene. In this section, we will deeply analyze one example from the real world.

Companies are looking for talent to develop agents for them. It is not just large companies; even mid-size or small companies and startups are looking for their options. You can find these jobs on freelancer websites like Upwork and Fiverr.

 

Marketing Agent

Let’s say a mid-size company from Europe wants you to create an agent, an agent that generates marketing proposals for their clients by using company documents.

On top of that, this agent should use the content by including relevant hotel information in this proposal for business events or campaigns.

But there is an issue: the agent frequently hallucinates. Why does this happen? Because instead of relying only on the company’s document, the model pulls information from its original training data. That training data may be outdated, because as you know, these LLMs are not updated regularly.

So, as a result, AI ends up adding incorrect hotel names or simply irrelevant information. Now you pinpoint the root cause of the problem: the lack of reliable information.

This is where RAG comes in. Using a web browsing API, companies have used LLMs to retrieve reliable information from the web and reference it, while generating answers on how. Let’s see this prompt.

“Generate a proposal, based on the tone of voice and company information, and use web search to find the hotel names.”

This web searching feature is becoming a RAG method.

 

Final Thoughts

 
In this article, we discovered the evolution of AI models and why RAG has been using them. As you can see, the reason has changed over time, but the problem remains: the efficiency.

Even if the reason is cost or speed, this method will continue to be used in AI-related tasks. And by “AI-related,” I don’t exclude data science, because, as you’re probably aware, with the upcoming AI summer, data science has already been deeply affected by AI too.

If you want to follow similar articles, solve 700+ interview questions related to Data Science, and 50+ Data projects, visit my platform.
 
 

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *