
Image by Author | Canva
Ever run a Python script and immediately wished you hadn’t pressed Enter?
Debugging in data science is not just an act; it’s a survival skill — particularly when dealing with messy datasets or devising prediction models on which actual people rely.
In this article, we will explore the basics of debugging, especially in your data science workflows, using a real-life dataset from a DoorDash delivery job, and most importantly, how to debug like a pro.
DoorDash Delivery Duration Prediction: What Are We Dealing With?
In this data project, DoorDash asked its data science candidates to predict the delivery duration. Let’s first look at the dataset info. Here is the code:
Here is the output:
It seems that they did not provide the delivery duration, so you should calculate it here. It is simple, but no worries if you are a beginner. Let’s see how it can be calculated.
import pandas as pd
from datetime import datetime
# Assuming historical_data is your DataFrame
historical_data["created_at"] = pd.to_datetime(historical_data['created_at'])
historical_data["actual_delivery_time"] = pd.to_datetime(historical_data['actual_delivery_time'])
historical_data["actual_total_delivery_duration"] = (historical_data["actual_delivery_time"] - historical_data["created_at"]).dt.total_seconds()
historical_data.head()
Here is the output’s head; you can see the actual_total_delivery_duration
.
Good, now we can start! But before that, here is the data definition language for this dataset.
Columns in historical_data.csv
Time features:
- market_id: A city/region in which DoorDash operates, e.g., Los Angeles, given in the data as an id.
- created_at: Timestamp in UTC when the order was submitted by the consumer to DoorDash. (Note: this timestamp is in UTC, but in case you need it, the actual timezone of the region was US/Pacific).
- actual_delivery_time: Timestamp in UTC when the order was delivered to the consumer.
Store features:
- store_id: An ID representing the restaurant the order was submitted for.
- store_primary_category: Cuisine category of the restaurant, e.g., Italian, Asian.
- order_protocol: A store can receive orders from DoorDash through many modes. This field represents an ID denoting the protocol.
Order features:
- total_items: Total number of items in the order.
- subtotal: Total value of the order submitted (in cents).
- num_distinct_items: Number of distinct items included in the order.
- min_item_price: Price of the item with the least cost in the order (in cents).
- max_item_price: Price of the item with the highest cost in the order (in cents).
Market features:
DoorDash being a marketplace, we have information on the state of the marketplace when the order is placed, which can be used to estimate delivery time. The following features are values at the time of created_at
(order submission time):
- total_onshift_dashers: Number of available dashers who are within 10 miles of the store at the time of order creation.
- total_busy_dashers: Subset of the above
total_onshift_dashers
who are currently working on an order. - total_outstanding_orders: Number of orders within 10 miles of this order that are currently being processed.
Predictions from other models:
We have predictions from other models for various stages of the delivery process that we can use:
- estimated_order_place_duration: Estimated time for the restaurant to receive the order from DoorDash (in seconds).
- estimated_store_to_consumer_driving_duration: Estimated travel time between the store and consumer (in seconds).
Great, so let’s get started!
Common Python Errors in Data Science Projects
In this section, we will discover common debugging errors in one of the data science projects, starting with reading the dataset and going through to the most important part: modeling.
Reading the Dataset: FileNotFoundError
, Dtype Warning, and Fixes
Case 1: File Not Found — Classic
In data science, your first bug often greets you at read_csv
. And not with a hello. Let’s debug that exact moment together, line by line. Here is the code:
import pandas as pd
try:
df = pd.read_csv('Strata Questions/historical_data.csv')
df.head(3)
except FileNotFoundError as e:
import os
print("File not found. Here's where Python is looking:")
print("Working directory:", os.getcwd())
print("Available files:", os.listdir())
raise e
Here is the output.
You don’t just raise an error—you interrogate it. This shows where the code thinks it is and what it sees around it. If your file’s not on the list, now you know. No guessing. Just facts.
Replace the path with the full one, and voilà!
Case 2: Dtype Misinterpretation — Python’s Quietly Wrong Guess
You load the dataset, but something’s off. The bug hides inside your types.
# Assuming df is your loaded DataFrame
try:
print("Column Types:n", df.dtypes)
except Exception as e:
print("Error reading dtypes:", e)
Here is the output.
Case 3: Date Parsing — The Silent Saboteur
We discovered that we should calculate the delivery duration first, and we did it with this method.
try:
# This code was shown earlier to calculate the delivery duration
df["created_at"] = pd.to_datetime(df['created_at'])
df["actual_delivery_time"] = pd.to_datetime(df['actual_delivery_time'])
df["actual_total_delivery_duration"] = (df["actual_delivery_time"] - df["created_at"]).dt.total_seconds()
print("Successfully calculated delivery duration and checked dtypes.")
print("Relevant dtypes:n", df[['created_at', 'actual_delivery_time', 'actual_total_delivery_duration']].dtypes)
except Exception as e:
print("Error during date processing:", e)
Here is the output.
Good and professional! Now we avoid those red errors, which will lift our mood—I know seeing them can dampen your motivation.
Handling Missing Data: KeyErrors
, NaNs
, and Logical Pitfalls
Some bugs don’t crash your code. They just give you the wrong results, silently, until you wonder why your model is trash.
This section digs into missing data—not just how to clean it, but how to debug it properly.
Case 1: KeyError — You Thought That Column Existed
Here is our code.
try:
print(df['store_rating'])
except KeyError as e:
print("Column not found:", e)
print("Here are the available columns:n", df.columns.tolist())
Here is the output.
The code didn’t break because of logic; it broke because of an assumption. That’s precisely where debugging lives. Always list your columns before accessing them blindly.
Case 2: NaN Count — Missing Values You Didn’t Expect
You assume everything’s clean. But real-world data always hides gaps. Let’s check for them.
try:
null_counts = df.isnull().sum()
print("Nulls per column:n", null_counts[null_counts > 0])
except Exception as e:
print("Failed to inspect nulls:", e)
Here is the output.
This exposes the silent troublemakers. Maybe store_primary_category
is missing in thousands of rows. Maybe timestamps failed conversion and are now NaT
.
You wouldn’t have known unless you checked. Debugging — confirming every assumption.
Case 3: Logical Pitfalls — Missing Data That Isn’t Actually Missing
Let’s say you try to filter orders where the subtotal is greater than 1,000,000, expecting hundreds of rows. But this gives you zero:
try:
filtered = df[df['subtotal'] > 1000000]
print("Rows with subtotal > 1,000,000:", filtered.shape[0])
except Exception as e:
print("Filtering error:", e)
That’s not a code error—it’s a logic error. You expected high-value orders, but maybe none exist above that threshold. Debug it with a range check:
print("Subtotal range:", df['subtotal'].min(), "to", df['subtotal'].max())
Here is the output.
Case 4: isna()
≠ Zero Doesn’t Mean It’s Clean
Even if isna().sum()
shows zero, there might be dirty data, like whitespace or ‘None’ as a string. Run a more aggressive check:
try:
fake_nulls = df[df['store_primary_category'].isin(['', ' ', 'None', None])]
print("Rows with fake missing categories:", fake_nulls.shape[0])
except Exception as e:
print("Fake missing value check failed:", e)
This catches hidden trash that isnull()
misses.
Feature Engineering Glitches: TypeErrors
, Date Parsing, and More
Feature engineering seems fun at first, until your new column breaks every model or throws a TypeError
mid-pipeline. Here’s how to debug that phase like someone who’s been burned before.
Case 1: You Think You Can Divide, But You Can’t
Let’s create a new feature. If an error occurs, our try-except
block will catch it.
try:
df['value_per_item'] = df['subtotal'] / df['total_items']
print("value_per_item created successfully")
except Exception as e:
print("Error occurred:", e)
Here is the output.
No errors? Good. But let’s look closer.
print(df[['subtotal', 'total_items', 'value_per_item']].sample(3))
Here is the output.
Case 2: Date Parsing Gone Wrong
Now, changing your dtype
is important, but what if you think everything was done correctly, yet problems persist?
# This is the standard way, but it can fail silently on mixed types
df["created_at"] = pd.to_datetime(df["created_at"])
df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"])
You might think it’s okay, but if your column has mixed types, it could fail silently or break your pipeline. That’s why, instead of directly making transformations, it’s better to use a robust function.
from datetime import datetime
def parse_date_debug(df, col):
try:
parsed = pd.to_datetime(df[col])
print(f"[SUCCESS] '{col}' parsed successfully.")
return parsed
except Exception as e:
print(f"[ERROR] Failed to parse '{col}':", e)
# Find non-date-like values to debug
non_datetimes = df[pd.to_datetime(df[col], errors="coerce").isna()][col].unique()
print("Sample values causing issue:", non_datetimes[:5])
raise
df["created_at"] = parse_date_debug(df, "created_at")
df["actual_delivery_time"] = parse_date_debug(df, "actual_delivery_time")
Here is the output.
This helps you trace faulty rows when datetime parsing crashes.
Case 3: Naive Division That Could Mislead
This won’t throw an error in our DataFrame as the columns are already numeric. But here’s the issue: some datasets sneak in object types, even when they look like numbers. That leads to:
- Misleading ratios
- Wrong model behavior
- No warnings
df["busy_dashers_ratio"] = df["total_busy_dashers"] / df["total_onshift_dashers"]
Let’s validate types before computing, even if the operation won’t throw an error.
import numpy as np
def create_ratio_debug(df, num_col, denom_col, new_col):
num_type = df[num_col].dtype
denom_type = df[denom_col].dtype
if not np.issubdtype(num_type, np.number) or not np.issubdtype(denom_type, np.number):
print(f"[TYPE WARNING] '{num_col}' or '{denom_col}' is not numeric.")
print(f"{num_col}: {num_type}, {denom_col}: {denom_type}")
df[new_col] = np.nan
return df
if (df[denom_col] == 0).any():
print(f"[DIVISION WARNING] '{denom_col}' contains zeros.")
df[new_col] = df[num_col] / df[denom_col]
return df
df = create_ratio_debug(df, "total_busy_dashers", "total_onshift_dashers", "busy_dashers_ratio")
Here is the output.
This gives visibility into potential division-by-zero issues and prevents silent bugs.
Modeling Mistakes: Shape Mismatch and Evaluation Confusion
Case 1: NaN Values in Features Cause Model to Crash
Let’s say we want to build a linear regression model. LinearRegression()
does not support NaN values natively. If any row in X has a missing value, the model refuses to train.
Here is the code, which deliberately creates a shape mismatch to trigger an error:
from sklearn.linear_model import LinearRegression
X_train = df[["estimated_order_place_duration", "estimated_store_to_consumer_driving_duration"]].iloc[:-10]
y_train = df["actual_total_delivery_duration"].iloc[:-5]
model = LinearRegression()
model.fit(X_train, y_train)
Here is the output.
Let’s debug this issue. First, we check for NaNs.
print(X_train.isna().sum())
Here is the output.
Good, let’s check the other variable too.
print(y_train.isna().sum())
Here is the output.
The mismatch and NaN values must be resolved. Here is the code to fix it.
from sklearn.linear_model import LinearRegression
# Re-align X and y to have the same length
X = df[["estimated_order_place_duration", "estimated_store_to_consumer_driving_duration"]]
y = df["actual_total_delivery_duration"]
# Step 1: Drop rows with NaN in features (X)
valid_X = X.dropna()
# Step 2: Align y to match the remaining indices of X
y_aligned = y.loc[valid_X.index]
# Step 3: Find indices where y is not NaN
valid_idx = y_aligned.dropna().index
# Step 4: Create final clean datasets
X_clean = valid_X.loc[valid_idx]
y_clean = y_aligned.loc[valid_idx]
model = LinearRegression()
model.fit(X_clean, y_clean)
print("✅ Model trained successfully!")
And voilà! Here is the output.
Case 2: Object Columns (Dates) Crash the Model
Let’s say you try to train a model using a timestamp like actual_delivery_time
.
But — oh no — it’s still an object or datetime type, and you accidentally mix it with numeric columns. Linear regression doesn’t like that one bit.
from sklearn.linear_model import LinearRegression
X = df[["actual_delivery_time", "estimated_order_place_duration"]]
y = df["actual_total_delivery_duration"]
model = LinearRegression()
model.fit(X, y)
Here is the error code:
You’re combining two incompatible data types in the X matrix:
- One column (
actual_delivery_time
) isdatetime64
. - The other (
estimated_order_place_duration
) isint64
.
Scikit-learn expects all features to be the same numeric dtype. It can’t handle mixed types like datetime and int. Let’s solve it by converting the datetime column to a numeric representation (Unix timestamp).
# Ensure datetime columns are parsed correctly, coercing errors to NaT
df["actual_delivery_time"] = pd.to_datetime(df["actual_delivery_time"], errors="coerce")
df["created_at"] = pd.to_datetime(df["created_at"], errors="coerce")
# Recalculate duration in case of new NaNs
df["actual_total_delivery_duration"] = (df["actual_delivery_time"] - df["created_at"]).dt.total_seconds()
# Convert datetime to a numeric feature (Unix timestamp in seconds)
df["delivery_time_timestamp"] = df["actual_delivery_time"].astype("int64") // 10**9
Good. Now that the dtypes are numeric, let’s apply the ML model.
from sklearn.linear_model import LinearRegression
# Use the new numeric timestamp feature
X = df[["delivery_time_timestamp", "estimated_order_place_duration"]]
y = df["actual_total_delivery_duration"]
# Drop any remaining NaNs from our feature set and target
X_clean = X.dropna()
y_clean = y.loc[X_clean.index].dropna()
X_clean = X_clean.loc[y_clean.index]
model = LinearRegression()
model.fit(X_clean, y_clean)
print("✅ Model trained successfully!")
Here is the output.
Great job!
Final Thoughts: Debug Smarter, Not Harder
Model crashes don’t always stem from complex bugs — sometimes, it’s just a stray NaN or an unconverted date column sneaking into your data pipeline.
Rather than wrestling with cryptic stack traces or tossing try-except
blocks like darts in the dark, dig into your DataFrame early. Peek at .info()
, check .isna().sum()
, and don’t shy away from .dtypes
. These simple steps unveil hidden landmines before you even hit fit()
.
I’ve shown you that even one overlooked object type or a sneaky missing value can sabotage a model. But with a sharper eye, cleaner prep, and intentional feature extraction, you’ll shift from debugging reactively to building intelligently.
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.