Most Candidates Fail These SQL Concepts in Data Interviews

Image by author | Canva

An interviewer’s job is to find the most suitable candidates for the advertised position. In doing so, they will gladly set up SQL interview questions to see if they can catch you off guard. There are several SQL concepts at which candidates often fail.

Hopefully, you’ll be one of those who avoid that destiny, as I’ll explain these concepts in detail below, complete with examples of how to solve certain problems correctly.

Most Candidates Fail These SQL Concepts in Data Interviews

# 1. Window Functions

Why It’s Hard: Candidates memorize what each window function does but don’t really understand how window frames, partitions, or ordering actually work.

Common Mistakes: A common mistake is not specifying ORDER BY in ranking window functions or value window functions, such as LEAD() or LAG(), and expecting the query to work or for the result to be deterministic.

Example: In this example, you need to find users who made a second purchase within 7 days of any previous purchase.

You might write this query.

WITH ordered_tx AS (
  SELECT user_id,
         created_at::date AS tx_date,
         LAG(created_at::DATE) OVER (PARTITION BY user_id) AS prev_tx_date
  FROM amazon_transactions
)

SELECT DISTINCT user_id
FROM ordered_tx
WHERE prev_tx_date IS NOT NULL AND tx_date - prev_tx_date <= 7;

At first glance, everything might seem right. The code even outputs something that might appear to be a correct answer.

Window Functions

First of all, we are lucky that the code works at all! This happens simply because I’m writing it in PostgreSQL. In some other SQL flavors, you’d get an error since ORDER BY is mandatory in ranking and analytical window functions.

Second, the output is wrong; I highlighted some rows that shouldn’t be there. Why do they appear, then?

They appear because we didn’t specify an ORDER BY clause in the LAG() window function. Without it, the row order is arbitrary. So, we are comparing the current transaction to some random earlier row for that user, not the one that occurred immediately before it in time.

This is not what the question asks. We need to compare each transaction to the previous one by date. In other words, we need to specify this explicitly in the ORDER BY clause within the LAG() function.

WITH ordered_tx AS (
  SELECT user_id,
         created_at::date AS tx_date,
         LAG(created_at::DATE) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_tx_date
  FROM amazon_transactions
)

SELECT DISTINCT user_id
FROM ordered_tx
WHERE prev_tx_date IS NOT NULL AND tx_date - prev_tx_date <= 7;

# 2. Filtering With Aggregates (Especially HAVING vs. WHERE)

Why It’s Hard: People often don’t understand the execution order in SQL, which is: FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY. This order means that WHERE filters rows before aggregation, and HAVING filters after. That also, logically, means that you can’t use aggregate functions in the WHERE clause.

Common Mistake: Trying to use aggregate functions in WHERE in a grouped query and getting an error.

Example: This interview question asks you to find the total revenue made by each winery. Only wineries where 90 is the lowest number of points for any of their varieties should be considered.

Many will see this as an easy question and hastily write this query.

SELECT winery,
       variety,
       SUM(price) AS total_revenue
FROM winemag_p1
WHERE MIN(points) >= 90
GROUP BY winery, variety
ORDER BY winery, total_revenue DESC;

However, that code will throw an error stating that aggregate functions are not allowed in the WHERE clause. This pretty much explains everything. The solution? Move the filtering condition from WHERE to HAVING.

SELECT winery,
       variety,
       SUM(price) AS total_revenue
FROM winemag_p1
GROUP BY winery, variety
HAVING MIN(points) >= 90
ORDER BY winery, total_revenue DESC;

# 3. Self-Joins for Time-Based or Event-Based Comparisons

Why It’s Hard: The idea of joining a table with itself is quite unintuitive, so candidates often forget it’s an option.

Common Mistake: Using subqueries and complicating the query when joining a table with itself would be simpler and faster, especially when filtering by dates or events.

Example: Here’s a question asking you to show the change of every currency’s exchange rate between 1 January 2020 and 1 July 2020.

You can solve this by writing an outer correlated subquery that fetches the July 1 exchange rates, then subtracts the January 1 exchange rates, which come from the inner subquery.

SELECT jan_rates.source_currency,
  (SELECT exchange_rate 
   FROM sf_exchange_rate 
   WHERE source_currency = jan_rates.source_currency AND date="2020-07-01") - jan_rates.exchange_rate AS difference
FROM (SELECT source_currency, exchange_rate
      FROM sf_exchange_rate
      WHERE date="2020-01-01"
) AS jan_rates;

This returns a correct output, but such a solution is unnecessarily complicated. A much simpler solution, with fewer lines of code, involves self-joining a table with itself and then applying two date filtering conditions in the WHERE clause.

SELECT jan.source_currency,
       jul.exchange_rate - jan.exchange_rate AS difference
FROM sf_exchange_rate jan
JOIN sf_exchange_rate jul ON jan.source_currency = jul.source_currency
WHERE jan.date="2020-01-01" AND jul.date="2020-07-01";

# 4. Subqueries vs. Common Table Expressions (CTEs)

Why It’s Hard: People often get stuck on subqueries because they learn them before Common Table Expressions (CTEs) and continue using them for any query with layered logic. However, subqueries can get messy very quickly.

Common Mistake: Using deeply nested SELECT statements when CTEs would be much simpler.

Example: In the interview question from Google and Netflix, you need to find the top actors based on their average movie rating within the genre in which they appear most frequently.

The solution using CTEs is as follows.

WITH genre_stats AS
  (SELECT actor_name,
          genre,
          COUNT(*) AS movie_count,
          AVG(movie_rating) AS avg_rating
   FROM top_actors_rating
   GROUP BY actor_name,
            genre),
            
max_genre_count AS
  (SELECT actor_name,
          MAX(movie_count) AS max_count
   FROM genre_stats
   GROUP BY actor_name),
     
top_genres AS
  (SELECT gs.*
   FROM genre_stats gs
   JOIN max_genre_count mgc ON gs.actor_name = mgc.actor_name
   AND gs.movie_count = mgc.max_count),
     
top_genre_avg AS
  (SELECT actor_name,
          MAX(avg_rating) AS max_avg_rating
   FROM top_genres
   GROUP BY actor_name),
   
filtered_top_genres AS
  (SELECT tg.*
   FROM top_genres tg
   JOIN top_genre_avg tga ON tg.actor_name = tga.actor_name
   AND tg.avg_rating = tga.max_avg_rating),
     ranked_actors AS
  (SELECT *,
          DENSE_RANK() OVER (
                             ORDER BY avg_rating DESC) AS rank
   FROM filtered_top_genres),
   
final_selection AS
  (SELECT MAX(rank) AS max_rank
   FROM ranked_actors
   WHERE rank <= 3)
   
SELECT actor_name,
       genre,
       avg_rating
FROM ranked_actors
WHERE rank <=
    (SELECT max_rank
     FROM final_selection);

It is relatively complicated, but it still consists of six clear CTEs, with the code’s readability enhanced by clear aliases.

Curious what the same solution would look like using only subqueries? Here it is.

SELECT ra.actor_name,
       ra.genre,
       ra.avg_rating
FROM (
    SELECT *,
           DENSE_RANK() OVER (ORDER BY avg_rating DESC) AS rank
    FROM (
        SELECT tg.*
        FROM (
            SELECT gs.*
            FROM (
                SELECT actor_name,
                       genre,
                       COUNT(*) AS movie_count,
                       AVG(movie_rating) AS avg_rating
                FROM top_actors_rating
                GROUP BY actor_name, genre
            ) AS gs
            JOIN (
                SELECT actor_name,
                       MAX(movie_count) AS max_count
                FROM (
                    SELECT actor_name,
                           genre,
                           COUNT(*) AS movie_count,
                           AVG(movie_rating) AS avg_rating
                    FROM top_actors_rating
                    GROUP BY actor_name, genre
                ) AS genre_stats
                GROUP BY actor_name
            ) AS mgc
            ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
        ) AS tg
        JOIN (
            SELECT actor_name,
                   MAX(avg_rating) AS max_avg_rating
            FROM (
                SELECT gs.*
                FROM (
                    SELECT actor_name,
                           genre,
                           COUNT(*) AS movie_count,
                           AVG(movie_rating) AS avg_rating
                    FROM top_actors_rating
                    GROUP BY actor_name, genre
                ) AS gs
                JOIN (
                    SELECT actor_name,
                           MAX(movie_count) AS max_count
                    FROM (
                        SELECT actor_name,
                               genre,
                               COUNT(*) AS movie_count,
                               AVG(movie_rating) AS avg_rating
                        FROM top_actors_rating
                        GROUP BY actor_name, genre
                    ) AS genre_stats
                    GROUP BY actor_name
                ) AS mgc
                ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
            ) AS top_genres
            GROUP BY actor_name
        ) AS tga
        ON tg.actor_name = tga.actor_name AND tg.avg_rating = tga.max_avg_rating
    ) AS filtered_top_genres
) AS ra
WHERE ra.rank <= (
    SELECT MAX(rank)
    FROM (
        SELECT *,
               DENSE_RANK() OVER (ORDER BY avg_rating DESC) AS rank
        FROM (
            SELECT tg.*
            FROM (
                SELECT gs.*
                FROM (
                    SELECT actor_name,
                           genre,
                           COUNT(*) AS movie_count,
                           AVG(movie_rating) AS avg_rating
                    FROM top_actors_rating
                    GROUP BY actor_name, genre
                ) AS gs
                JOIN (
                    SELECT actor_name,
                           MAX(movie_count) AS max_count
                    FROM (
                        SELECT actor_name,
                               genre,
                               COUNT(*) AS movie_count,
                               AVG(movie_rating) AS avg_rating
                        FROM top_actors_rating
                        GROUP BY actor_name, genre
                    ) AS genre_stats
                    GROUP BY actor_name
                ) AS mgc
                ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
            ) AS tg
            JOIN (
                SELECT actor_name,
                       MAX(avg_rating) AS max_avg_rating
                FROM (
                    SELECT gs.*
                    FROM (
                        SELECT actor_name,
                               genre,
                               COUNT(*) AS movie_count,
                               AVG(movie_rating) AS avg_rating
                        FROM top_actors_rating
                        GROUP BY actor_name, genre
                    ) AS gs
                    JOIN (
                        SELECT actor_name,
                               MAX(movie_count) AS max_count
                        FROM (
                            SELECT actor_name,
                                   genre,
                                   COUNT(*) AS movie_count,
                                   AVG(movie_rating) AS avg_rating
                            FROM top_actors_rating
                            GROUP BY actor_name, genre
                        ) AS genre_stats
                        GROUP BY actor_name
                    ) AS mgc
                    ON gs.actor_name = mgc.actor_name AND gs.movie_count = mgc.max_count
                ) AS top_genres
                GROUP BY actor_name
            ) AS tga
            ON tg.actor_name = tga.actor_name AND tg.avg_rating = tga.max_avg_rating
        ) AS filtered_top_genres
    ) AS ranked_actors
    WHERE rank <= 3
);

There is redundant logic repeated across subqueries. How many subqueries is that? I have no idea. The code is impossible to maintain. Even though I just wrote it, I’d still need half a day to understand it if I wanted to change something tomorrow. Additionally, the completely meaningless subquery aliases do not help.

# 5. Handling NULLs in Logic

Why It’s Hard: Candidates often think that NULL is equal to something. It’s not. NULL isn’t equal to anything — not even itself. Logic involving NULLs behaves differently from logic involving actual values.

Common Mistake: Using = NULL instead of IS NULL in filtering or missing output rows because NULLs break the condition logic.

Example: There’s an interview question by IBM that asks you to calculate the total number of interactions and the total number of contents created for each customer.

It doesn’t sound too tricky, so you might write this solution with two CTEs, where one CTE counts the number of interactions per customer, while the other counts the number of content items created by a customer. In the final SELECT, you FULL OUTER JOIN the two CTEs, and you have the solution. Right?

WITH interactions_summary AS
  (SELECT customer_id,
          COUNT(*) AS total_interactions
   FROM customer_interactions
   GROUP BY customer_id),
   
content_summary AS
  (SELECT customer_id,
          COUNT(*) AS total_content_items
   FROM user_content
   GROUP BY customer_id)
   
SELECT i.customer_id,
  i.total_interactions,
  c.total_content_items
FROM interactions_summary AS i
FULL OUTER JOIN content_summary AS c ON i.customer_id = c.customer_id
ORDER BY customer_id;

Almost right. Here’s the output. (By the way, you see double quotation marks (“”) instead of NULL. That’s how the StrataScratch UI displays it, but trust me, the engine still treats them for what they are: NULL values).

5. Handling NULLs in Logic

The highlighted rows contain NULLs. This makes the output incorrect. A NULL value is neither the customer ID nor the number of interactions and contents, which the question explicitly asks you to show.

What we’re missing in the above solution is COALESCE() to handle NULLs in the final SELECT. Now, all the customers without interactions will get their IDs from the content_summary CTE. Also, for customers that don’t have interactions, or content, or both, we’ll now replace NULL with 0, which is a valid number.

WITH interactions_summary AS
  (SELECT customer_id,
          COUNT(*) AS total_interactions
   FROM customer_interactions
   GROUP BY customer_id),
   
content_summary AS
  (SELECT customer_id,
          COUNT(*) AS total_content_items
   FROM user_content
   GROUP BY customer_id)
   
SELECT COALESCE(i.customer_id, c.customer_id) AS customer_id,
       COALESCE(i.total_interactions, 0) AS total_interactions,
       COALESCE(c.total_content_items, 0) AS total_content_items
FROM interactions_summary AS i
FULL OUTER JOIN content_summary AS c ON i.customer_id = c.customer_id
ORDER BY customer_id;

# 6. Group-Based Deduplication

Why It’s Hard: Group-based deduplication means you’re selecting one row per group, e.g., “most recent”, “highest score”, etc. At first, it sounds like you only need to pick one row per user. But you can’t use GROUP BY unless you aggregate. On the other hand, you often need a full row, not a single value that aggregation and GROUP BY return.

Common Mistake: Using GROUP BY + LIMIT 1 (or DISTINCT ON, which is PostgreSQL-specific) instead of ROW_NUMBER() or RANK(), the latter if you want ties included.

Example: This question asks you to identify the best-selling item for each month, and there’s no need to separate months by year. The best-selling item is calculated as unitprice * quantity.

The naive approach would be this. First, extract the sale month from invoicedate, select description, and find the total sales by summing unitprice * quantity. Then, to get the total sales by month and product description, we simply GROUP BY those two columns. Finally, we only need to use ORDER BY to sort the output from the best to the worst-selling product and use LIMIT 1 to output only the first row, i.e., the best-selling item.

SELECT DATE_PART('MONTH', invoicedate) AS sale_month,
       description,
       SUM(unitprice * quantity) AS total_paid
FROM online_retail
GROUP BY sale_month, description
ORDER BY total_paid DESC
LIMIT 1;

As I said, this is naive; the output somewhat resembles what we need, but we need this for every month, not just one.

Group-Based Deduplication

One of the correct approaches is to use the RANK() window function. With this approach, we follow a similar method to the previous code. The difference is that the query now becomes a subquery in the FROM clause. In addition, we use RANK() to partition the data by month and then rank the rows within each partition (i.e., for each month separately) from the best-selling to the worst-selling item.

Then, in the main query, we simply select the required columns and output only rows where the rank is 1 using the WHERE clause.

SELECT month,
       description,
       total_paid
FROM
  (SELECT DATE_PART('month', invoicedate) AS month,
          description,
          SUM(unitprice * quantity) AS total_paid,
          RANK() OVER (PARTITION BY DATE_PART('month', invoicedate) ORDER BY SUM(unitprice * quantity) DESC) AS rnk
   FROM online_retail
   GROUP BY month, description) AS tmp
WHERE rnk = 1;

# Conclusion

The six concepts we’ve covered commonly appear in SQL coding interview questions. Pay attention to them, practice interview questions that involve these concepts, learn the correct approaches, and you’ll significantly improve your chances in your interviews.

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.

Most Candidates Fail These SQL Concepts in Data Interviews

# 1. Window Functions

# 2. Filtering With Aggregates (Especially HAVING vs. WHERE)

# 3. Self-Joins for Time-Based or Event-Based Comparisons

# 4. Subqueries vs. Common Table Expressions (CTEs)

# 5. Handling NULLs in Logic

# 6. Group-Based Deduplication

# Conclusion

Related Posts

How to Context Engineer to Optimize Question Answering Pipelines

Showcasing Your Work on HuggingFace Spaces

Leave a Reply Cancel reply