
Image by Author | Canva
With large lagnuage models (LLMs), everyone is a coder today! This is a message you get from the LLM promo materials. It’s obviously not true, just like any ad. Coding is much more than producing code at breakneck speed. However, translating English (or other natural languages) into executable SQL queries is one of the most compelling uses of LLMs, and it has its place in the world.
# Why Use LLMs to Generate SQL?
There are several benefits of using LLMs to generate SQL, and, as with everything, there are also some cons.
# Two Types of Text-to-SQL LLMs
We can distinguish between two very broad types of text-to-SQL technology currently available regarding their access to your database schema.
- LLMs without direct access
- LLMs with direct access
// 1. LLMs Without Direct Access to Database Schema
These LLMs don’t connect to or execute queries against the actual database. The closest you can get is to upload the datasets you want to query. These tools rely on you providing context about your schema.
Tool Examples:
Use Cases:
- Query drafting and prototyping
- Learning and teaching
- Static code generation for later review
// 2. LLMs With Direct Access to Database Schema
These LLMs connect directly to your live data sources, such as PostgreSQL, Snowflake, BigQuery, or Redshift. They allow you to generate, execute, and return results from SQL queries live on your database.
Tool Examples:
Use Cases:
- Conversational analytics for business users
- Real-time data exploration
- Embedded AI assistants in BI platforms
# Step-by-Step: How to Go from Text to SQL
The basic workflow of getting SQL from text is similar, whether you use disconnected or connected LLMs.
We’ll try to solve an interview question from Shopify and Amazon using the steps above in ChatGPT.
// 1. Define the Schema
For the query to work on your data, the LLM needs to understand your data structure clearly. This typically encompasses:
- Table names
- Column names and types
- Relationships between tables (joins, keys)
This information can be passed directly in the prompt or can be retrieved dynamically using vector search within the retrieval-augmented generation (RAG) pipeline.
// 2. Prompt With Natural Language
The prompt will typically consist of two segments:
- Schema definition
- Question(s) for which we need an SQL answer
Example: Let me first provide you with a prompt structure that includes placeholders. We’ll then write an actual prompt.
We will use role-play prompting, which means instructing ChatGPT to assume a specific role.
Here’s how to structure the prompt.
Dataset: My dataset consists of [number of tables] tables.
The first one is [table “” not found /]
with the following columns and data types:
[column names and data types]
The second table is [table “” not found /]
with the following columns and data types:
[column names and data types]
Question: [provide a question to be answered]
Assumptions: [provide assumptions for solving the question]
Role: [describe a role LLM has to play]
Let’s now write an actual prompt that solves our interview questions.
Here it is, and we hope it will provide a correct SQL solution. The column names, data types, question, and assumptions are just pasted from the question itself.
Dataset: My dataset consists of two tables.
The first one is "customers" with the following columns and data types:
address:
text
city:
text
first_name:
text
id:
bigint
last_name:
text
phone_number:
text
The second table is "orders" with the following columns and data types:
cust_id:
bigint
id:
bigint
order_date:
date
order_details:
text
total_order_cost:
bigint
Question: Find the customers with the highest daily total order cost between 2019-02-01 and 2019-05-01. If a customer had more than one order on a certain day, sum the order costs on a daily basis. Output each customer's first name, total cost of their items, and the date.
Assumption: For simplicity, you can assume that every first name in the dataset is unique.
Role: Act as a SQL expert and write a PostgreSQL query that answers the question.
// 3. Generate the SQL Query
In this stage, the LLM of your choice generates the SQL query.
Example: Here’s the PostgreSQL code ChatGPT produced for us.
SELECT
c.first_name,
daily_totals.order_date,
daily_totals.total_cost
FROM (
SELECT
cust_id,
order_date,
SUM(total_order_cost) AS total_cost
FROM orders
WHERE order_date BETWEEN '2019-02-01' AND '2019-05-01'
GROUP BY cust_id, order_date
) AS daily_totals
JOIN customers c ON c.id = daily_totals.cust_id
WHERE (daily_totals.cust_id, daily_totals.order_date) IN (
SELECT
cust_id,
order_date
FROM (
SELECT
cust_id,
order_date,
SUM(total_order_cost) AS total_cost,
RANK() OVER (ORDER BY SUM(total_order_cost) DESC) AS rnk
FROM orders
WHERE order_date BETWEEN '2019-02-01' AND '2019-05-01'
GROUP BY cust_id, order_date
) ranked
WHERE rnk = 1
);
// 4. Execute the Query
Now, you can execute the query, be it directly (if your LLM supports live connections) or by copying and running it within your database system.
Example: In our case, we will copy the code into the StrataScratch code editor and use the ‘Check Solution’ button to validate the solution.
This is what it outputs, and it’s a correct solution. Bravo for ChatGPT! It nailed it on the first try!
first_name | order_date | total_cost |
---|---|---|
Jill | 2019-04-19 | 275 |
Mark | 2019-04-19 | 275 |
// 5. Review, Visualize, and Refine
Depending on the purpose of using LLMs to write SQL code, this step may be optional. In the business world, you’d typically present the query output in a user-friendly format, which typically involves:
- Showing results as a table and/or chart
- Allowing follow-up requirements (e.g., “Can you include the customer city?”) and providing the changed query and output
# Pitfalls and Best Practices
In our example, ChatGPT immediately came up with the correct answer. However, it doesn’t mean it always does, especially when data and requirements get more complicated. Using LLMs to get SQL queries from text is not without pitfalls. You can avoid them by applying some best practices if you want to make LLM query generation a part of your data science workflow.
# Conclusion
LLMs can be your best friend when you want to create SQL queries from text. However, to make the best of these tools, you must have a clear understanding of what you want to achieve and the use cases where using LLMs is beneficial.
This article provides you with such guidelines, along with an example of how to prompt an LLM in natural language and get a working SQL code.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.