
Image by author | Canva
An interviewer’s job is to find the most suitable candidates for the advertised position. In doing so, they will gladly set up SQL interview questions to see if they can catch you off guard. There are several SQL concepts at which candidates often fail.
Hopefully, you’ll be one of those who avoid that destiny, as I’ll explain these concepts in detail below, complete with examples of how to solve certain problems correctly.
# 1. Window Functions
Why It’s Hard: Candidates memorize what each window function does but don’t really understand how window frames, partitions, or ordering actually work.
Common Mistakes: A common mistake is not specifying ORDER BY
in ranking window functions or value window functions, such as LEAD()
or LAG()
, and expecting the query to work or for the result to be deterministic.
Example: In this example, you need to find users who made a second purchase within 7 days of any previous purchase.
You might write this query.
WITH ordered_tx AS (
SELECT user_id,
created_at::date AS tx_date,
LAG(created_at::DATE) OVER (PARTITION BY user_id) AS prev_tx_date
FROM amazon_transactions
)
SELECT DISTINCT user_id
FROM ordered_tx
WHERE prev_tx_date IS NOT NULL AND tx_date - prev_tx_date <= 7;
At first glance, everything might seem right. The code even outputs something that might appear to be a correct answer.
First of all, we are lucky that the code works at all! This happens simply because I’m writing it in PostgreSQL. In some other SQL flavors, you’d get an error since ORDER BY
is mandatory in ranking and analytical window functions.
Second, the output is wrong; I highlighted some rows that shouldn’t be there. Why do they appear, then?
They appear because we didn’t specify an ORDER BY
clause in the LAG()
window function. Without it, the row order is arbitrary. So, we are comparing the current transaction to some random earlier row for that user, not the one that occurred immediately before it in time.
This is not what the question asks. We need to compare each transaction to the previous one by date. In other words, we need to specify this explicitly in the ORDER BY
clause within the LAG()
function.
WITH ordered_tx AS (
SELECT user_id,
created_at::date AS tx_date,
LAG(created_at::DATE) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_tx_date
FROM amazon_transactions
)
SELECT DISTINCT user_id
FROM ordered_tx
WHERE prev_tx_date IS NOT NULL AND tx_date - prev_tx_date <= 7;
# 2. Filtering With Aggregates (Especially HAVING vs. WHERE)
Why It’s Hard: People often don’t understand the execution order in SQL, which is: FROM
-> WHERE
-> GROUP BY
-> HAVING
-> SELECT
-> ORDER BY
. This order means that WHERE
filters rows before aggregation, and HAVING
filters after. That also, logically, means that you can’t use aggregate functions in the WHERE
clause.
Common Mistake: Trying to use aggregate functions in WHERE
in a grouped query and getting an error.
Example: This interview question asks you to find the total revenue made by each winery. Only wineries where 90 is the lowest number of points for any of their varieties should be considered.
Many will see this as an easy question and hastily write this query.
SELECT winery,
variety,
SUM(price) AS total_revenue
FROM winemag_p1
WHERE MIN(points) >= 90
GROUP BY winery, variety
ORDER BY winery, total_revenue DESC;
However, that code will throw an error stating that aggregate functions are not allowed in the WHERE
clause. This pretty much explains everything. The solution? Move the filtering condition from WHERE
to HAVING
.
SELECT winery,
variety,
SUM(price) AS total_revenue
FROM winemag_p1
GROUP BY winery, variety
HAVING MIN(points) >= 90
ORDER BY winery, total_revenue DESC;
# 3. Self-Joins for Time-Based or Event-Based Comparisons
Why It’s Hard: The idea of joining a table with itself is quite unintuitive, so candidates often forget it’s an option.
Common Mistake: Using subqueries and complicating the query when joining a table with itself would be simpler and faster, especially when filtering by dates or events.
Example: Here’s a question asking you to show the change of every currency’s exchange rate between 1 January 2020 and 1 July 2020.
You can solve this by writing an outer correlated subquery that fetches the July 1 exchange rates, then subtracts the January 1 exchange rates, which come from the inner subquery.
SELECT jan_rates.source_currency,
(SELECT exchange_rate
FROM sf_exchange_rate
WHERE source_currency = jan_rates.source_currency AND date="2020-07-01") - jan_rates.exchange_rate AS difference
FROM (SELECT source_currency, exchange_rate
FROM sf_exchange_rate
WHERE date="2020-01-01"
) AS jan_rates;
This returns a correct output, but such a solution is unnecessarily complicated. A much simpler solution, with fewer lines of code, involves self-joining a table with itself and then applying two date filtering conditions in the WHERE
clause.
SELECT jan.source_currency,
jul.exchange_rate - jan.exchange_rate AS difference
FROM sf_exchange_rate jan
JOIN sf_exchange_rate jul ON jan.source_currency = jul.source_currency
WHERE jan.date="2020-01-01" AND jul.date="2020-07-01";
# 4. Subqueries vs. Common Table Expressions (CTEs)
Why It’s Hard: People often get stuck on subqueries because they learn them before Common Table Expressions (CTEs) and continue using them for any query with layered logic. However, subqueries can get messy very quickly.
Common Mistake: Using deeply nested SELECT
statements when CTEs would be much simpler.
Example: In the interview question from Google and Netflix, you need to find the top actors based on their average movie rating within the genre in which they appear most frequently.
The solution using CTEs is as follows.
WITH genre_stats AS
(SELECT actor_name,
genre,
COUNT(*) AS movie_count,
AVG(movie_rating) AS avg_rating
FROM top_actors_rating
GROUP BY actor_name,
genre),
max_genre_count AS
(SELECT actor_name,
MAX(movie_count) AS max_count
FROM genre_stats
GROUP BY actor_name),
top_genres AS
(SELECT gs.*
FROM genre_stats gs
JOIN max_genre_count mgc ON gs.actor_name = mgc.actor_name
AND gs.movie_count = mgc.max_count),
top_genre_avg AS
(SELECT actor_name,
MAX(avg_rating) AS max_avg_rating
FROM top_genres
GROUP BY actor_name),
filtered_top_genres AS
(SELECT tg.*
FROM top_genres tg
JOIN top_genre_avg tga ON tg.actor_name = tga.actor_name
AND tg.avg_rating = tga.max_avg_rating),
ranked_actors AS
(SELECT *,
DENSE_RANK() OVER (
ORDER BY avg_rating DESC) AS rank
FROM filtered_top_genres),
final_selection AS
(SELECT MAX(rank) AS max_rank
FROM ranked_actors
WHERE rank <= 3)
SELECT actor_name,
genre,
avg_rating
FROM ranked_actors
WHERE rank <=
(SELECT max_rank
FROM final_selection);
It is relatively complicated, but it still consists of six clear CTEs, with the code’s readability enhanced by clear aliases.
Curious what the same solution would look like using only subqueries? Here it is.
SELECT ra.actor_name,
ra.genre,
ra.avg_rating
FROM (
SELECT *,
DENSE_RANK() OVER (ORDER BY avg_rating DESC) AS rank
FROM (
SELECT tg.*
FROM (
SELECT gs.*
FROM (
SELECT actor_name,
genre,
COUNT(*) AS movie_count,
AVG(movie_rating) AS avg_rating
FROM top_actors_rating
GROUP BY actor_name, genre
) AS gs
JOIN (
SELECT actor_name,
MAX(movie_count) AS max_count
FROM (
SELECT actor_name,
genre,
COUNT(*) AS movie_count,
AVG(movie_rating) AS avg_rating
FROM top_actors_rating
GROUP BY actor_name, genre
) AS genre_stats
GROUP BY actor_name
) AS mgc
ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
) AS tg
JOIN (
SELECT actor_name,
MAX(avg_rating) AS max_avg_rating
FROM (
SELECT gs.*
FROM (
SELECT actor_name,
genre,
COUNT(*) AS movie_count,
AVG(movie_rating) AS avg_rating
FROM top_actors_rating
GROUP BY actor_name, genre
) AS gs
JOIN (
SELECT actor_name,
MAX(movie_count) AS max_count
FROM (
SELECT actor_name,
genre,
COUNT(*) AS movie_count,
AVG(movie_rating) AS avg_rating
FROM top_actors_rating
GROUP BY actor_name, genre
) AS genre_stats
GROUP BY actor_name
) AS mgc
ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
) AS top_genres
GROUP BY actor_name
) AS tga
ON tg.actor_name = tga.actor_name AND tg.avg_rating = tga.max_avg_rating
) AS filtered_top_genres
) AS ra
WHERE ra.rank <= (
SELECT MAX(rank)
FROM (
SELECT *,
DENSE_RANK() OVER (ORDER BY avg_rating DESC) AS rank
FROM (
SELECT tg.*
FROM (
SELECT gs.*
FROM (
SELECT actor_name,
genre,
COUNT(*) AS movie_count,
AVG(movie_rating) AS avg_rating
FROM top_actors_rating
GROUP BY actor_name, genre
) AS gs
JOIN (
SELECT actor_name,
MAX(movie_count) AS max_count
FROM (
SELECT actor_name,
genre,
COUNT(*) AS movie_count,
AVG(movie_rating) AS avg_rating
FROM top_actors_rating
GROUP BY actor_name, genre
) AS genre_stats
GROUP BY actor_name
) AS mgc
ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
) AS tg
JOIN (
SELECT actor_name,
MAX(avg_rating) AS max_avg_rating
FROM (
SELECT gs.*
FROM (
SELECT actor_name,
genre,
COUNT(*) AS movie_count,
AVG(movie_rating) AS avg_rating
FROM top_actors_rating
GROUP BY actor_name, genre
) AS gs
JOIN (
SELECT actor_name,
MAX(movie_count) AS max_count
FROM (
SELECT actor_name,
genre,
COUNT(*) AS movie_count,
AVG(movie_rating) AS avg_rating
FROM top_actors_rating
GROUP BY actor_name, genre
) AS genre_stats
GROUP BY actor_name
) AS mgc
ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
) AS top_genres
GROUP BY actor_name
) AS tga
ON tg.actor_name = tga.actor_name AND tg.avg_rating = tga.max_avg_rating
) AS filtered_top_genres
) AS ranked_actors
WHERE rank <= 3
);
There is redundant logic repeated across subqueries. How many subqueries is that? I have no idea. The code is impossible to maintain. Even though I just wrote it, I’d still need half a day to understand it if I wanted to change something tomorrow. Additionally, the completely meaningless subquery aliases do not help.
# 5. Handling NULLs in Logic
Why It’s Hard: Candidates often think that NULL
is equal to something. It’s not. NULL
isn’t equal to anything — not even itself. Logic involving NULL
s behaves differently from logic involving actual values.
Common Mistake: Using = NULL
instead of IS NULL
in filtering or missing output rows because NULL
s break the condition logic.
Example: There’s an interview question by IBM that asks you to calculate the total number of interactions and the total number of contents created for each customer.
It doesn’t sound too tricky, so you might write this solution with two CTEs, where one CTE counts the number of interactions per customer, while the other counts the number of content items created by a customer. In the final SELECT
, you FULL OUTER JOIN
the two CTEs, and you have the solution. Right?
WITH interactions_summary AS
(SELECT customer_id,
COUNT(*) AS total_interactions
FROM customer_interactions
GROUP BY customer_id),
content_summary AS
(SELECT customer_id,
COUNT(*) AS total_content_items
FROM user_content
GROUP BY customer_id)
SELECT i.customer_id,
i.total_interactions,
c.total_content_items
FROM interactions_summary AS i
FULL OUTER JOIN content_summary AS c ON i.customer_id = c.customer_id
ORDER BY customer_id;
Almost right. Here’s the output. (By the way, you see double quotation marks (“”) instead of NULL
. That’s how the StrataScratch UI displays it, but trust me, the engine still treats them for what they are: NULL
values).
The highlighted rows contain NULL
s. This makes the output incorrect. A NULL
value is neither the customer ID nor the number of interactions and contents, which the question explicitly asks you to show.
What we’re missing in the above solution is COALESCE()
to handle NULL
s in the final SELECT
. Now, all the customers without interactions will get their IDs from the content_summary
CTE. Also, for customers that don’t have interactions, or content, or both, we’ll now replace NULL
with 0, which is a valid number.
WITH interactions_summary AS
(SELECT customer_id,
COUNT(*) AS total_interactions
FROM customer_interactions
GROUP BY customer_id),
content_summary AS
(SELECT customer_id,
COUNT(*) AS total_content_items
FROM user_content
GROUP BY customer_id)
SELECT COALESCE(i.customer_id, c.customer_id) AS customer_id,
COALESCE(i.total_interactions, 0) AS total_interactions,
COALESCE(c.total_content_items, 0) AS total_content_items
FROM interactions_summary AS i
FULL OUTER JOIN content_summary AS c ON i.customer_id = c.customer_id
ORDER BY customer_id;
# 6. Group-Based Deduplication
Why It’s Hard: Group-based deduplication means you’re selecting one row per group, e.g., “most recent”, “highest score”, etc. At first, it sounds like you only need to pick one row per user. But you can’t use GROUP BY
unless you aggregate. On the other hand, you often need a full row, not a single value that aggregation and GROUP BY
return.
Common Mistake: Using GROUP BY
+ LIMIT 1
(or DISTINCT ON, which is PostgreSQL-specific) instead of ROW_NUMBER()
or RANK()
, the latter if you want ties included.
Example: This question asks you to identify the best-selling item for each month, and there’s no need to separate months by year. The best-selling item is calculated as unitprice * quantity
.
The naive approach would be this. First, extract the sale month from invoicedate
, select description
, and find the total sales by summing unitprice * quantity
. Then, to get the total sales by month and product description, we simply GROUP BY
those two columns. Finally, we only need to use ORDER BY
to sort the output from the best to the worst-selling product and use LIMIT 1
to output only the first row, i.e., the best-selling item.
SELECT DATE_PART('MONTH', invoicedate) AS sale_month,
description,
SUM(unitprice * quantity) AS total_paid
FROM online_retail
GROUP BY sale_month, description
ORDER BY total_paid DESC
LIMIT 1;
As I said, this is naive; the output somewhat resembles what we need, but we need this for every month, not just one.
One of the correct approaches is to use the RANK()
window function. With this approach, we follow a similar method to the previous code. The difference is that the query now becomes a subquery in the FROM
clause. In addition, we use RANK()
to partition the data by month and then rank the rows within each partition (i.e., for each month separately) from the best-selling to the worst-selling item.
Then, in the main query, we simply select the required columns and output only rows where the rank is 1 using the WHERE
clause.
SELECT month,
description,
total_paid
FROM
(SELECT DATE_PART('month', invoicedate) AS month,
description,
SUM(unitprice * quantity) AS total_paid,
RANK() OVER (PARTITION BY DATE_PART('month', invoicedate) ORDER BY SUM(unitprice * quantity) DESC) AS rnk
FROM online_retail
GROUP BY month, description) AS tmp
WHERE rnk = 1;
# Conclusion
The six concepts we’ve covered commonly appear in SQL coding interview questions. Pay attention to them, practice interview questions that involve these concepts, learn the correct approaches, and you’ll significantly improve your chances in your interviews.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.