Finding Heavy Traffic Indicators on I-94

In this project walkthrough, we’ll explore how to use data visualization techniques to uncover traffic patterns on Interstate 94, one of America’s busiest highways. By analyzing real-world traffic volume data along with weather conditions and time-based factors, we’ll identify key indicators of heavy traffic that could help commuters plan their travel times more effectively.

Traffic congestion is a daily challenge for millions of commuters. Understanding when and why heavy traffic occurs can help drivers make informed decisions about their travel times, and help city planners optimize traffic flow. Through this hands-on analysis, we’ll discover surprising patterns that go beyond the obvious rush-hour expectations.

Throughout this tutorial, we’ll build multiple visualizations that tell a comprehensive story about traffic patterns, demonstrating how exploratory data visualization can reveal insights that summary statistics alone might miss.

What You’ll Learn

By the end of this tutorial, you’ll know how to:

Create and interpret histograms to understand traffic volume distributions
Use time series visualizations to identify daily, weekly, and monthly patterns
Build side-by-side plots for effective comparisons
Analyze correlations between weather conditions and traffic volume
Apply grouping and aggregation techniques for time-based analysis
Combine multiple visualization types to tell a complete data story

Before You Start: Pre-Instruction

To make the most of this project walkthrough, follow these preparatory steps:

Review the Project

Access the project and familiarize yourself with the goals and structure: Finding Heavy Traffic Indicators Project.
Access the Solution Notebook

You can view and download it here to see what we’ll be covering: Solution Notebook
Prepare Your Environment
- If you’re using the Dataquest platform, everything is already set up for you
- If working locally, ensure you have Python with pandas, matplotlib, and seaborn installed
- Download the dataset from the UCI Machine Learning Repository
Prerequisites

New to Markdown? We recommend learning the basics to format headers and add context to your Jupyter notebook: Markdown Guide.

Setting Up Your Environment

Let’s begin by importing the necessary libraries and loading our dataset:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The %matplotlib inline command is Jupyter magic that ensures our plots render directly in the notebook. This is essential for an interactive data exploration workflow.

traffic = pd.read_csv('Metro_Interstate_Traffic_Volume.csv')
traffic.head()

   holiday   temp  rain_1h  snow_1h  clouds_all weather_main  \
0      NaN  288.28      0.0      0.0          40       Clouds
1      NaN  289.36      0.0      0.0          75       Clouds
2      NaN  289.58      0.0      0.0          90       Clouds
3      NaN  290.13      0.0      0.0          90       Clouds
4      NaN  291.14      0.0      0.0          75       Clouds

      weather_description            date_time  traffic_volume
0      scattered clouds  2012-10-02 09:00:00            5545
1        broken clouds  2012-10-02 10:00:00            4516
2      overcast clouds  2012-10-02 11:00:00            4767
3      overcast clouds  2012-10-02 12:00:00            5026
4        broken clouds  2012-10-02 13:00:00            4918

Our dataset contains hourly traffic volume measurements from a station between Minneapolis and St. Paul on westbound I-94, along with weather conditions for each hour. Key columns include:

holiday: Name of holiday (if applicable)
temp: Temperature in Kelvin
rain_1h: Rainfall in mm for the hour
snow_1h: Snowfall in mm for the hour
clouds_all: Percentage of cloud cover
weather_main: General weather category
weather_description: Detailed weather description
date_time: Timestamp of the measurement
traffic_volume: Number of vehicles (our target variable)

Learning Insight: Notice the temperatures are in Kelvin (around 288K = 15°C = 59°F). This is unusual for everyday use but common in scientific datasets. When presenting findings to stakeholders, you might want to convert these to Fahrenheit or Celsius for better interpretability.

Initial Data Exploration

Before diving into visualizations, let’s understand our dataset structure:

traffic.info()


RangeIndex: 48204 entries, 0 to 48203
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   holiday              61 non-null     object
 1   temp                 48204 non-null  float64
 2   rain_1h              48204 non-null  float64
 3   snow_1h              48204 non-null  float64
 4   clouds_all           48204 non-null  int64
 5   weather_main         48204 non-null  object
 6   weather_description  48204 non-null  object
 7   date_time            48204 non-null  object
 8   traffic_volume       48204 non-null  int64
dtypes: float64(3), int64(2), object(4)
memory usage: 3.3+ MB

We have nearly 50,000 hourly observations spanning several years. Notice that the holiday column has only 61 non-null values out of 48,204 rows. Let’s investigate:

traffic['holiday'].value_counts()

holiday
Labor Day                    7
Christmas Day                6
Thanksgiving Day             6
Martin Luther King Jr Day    6
New Years Day                6
Veterans Day                 5
Columbus Day                 5
Memorial Day                 5
Washingtons Birthday         5
State Fair                   5
Independence Day             5
Name: count, dtype: int64

Learning Insight: At first glance, you might think the holiday column is nearly useless with so few values. But actually, holidays are only marked at midnight on the holiday itself. This is a great example of how understanding your data’s structure can make a big difference: what looks like missing data might actually be a deliberate design choice. For a complete analysis, you’d want to expand these holiday markers to cover all 24 hours of each holiday.

Let’s examine our numeric variables:

traffic.describe()

              temp       rain_1h       snow_1h    clouds_all  traffic_volume
count  48204.000000  48204.000000  48204.000000  48204.000000    48204.000000
mean     281.205870      0.334264      0.000222     49.362231     3259.818355
std       13.338232     44.789133      0.008168     39.015750     1986.860670
min        0.000000      0.000000      0.000000      0.000000        0.000000
25%      272.160000      0.000000      0.000000      1.000000     1193.000000
50%      282.450000      0.000000      0.000000     64.000000     3380.000000
75%      291.806000      0.000000      0.000000     90.000000     4933.000000
max      310.070000   9831.300000      0.510000    100.000000     7280.000000

Key observations:

Temperature ranges from 0K to 310K (that 0K is suspicious and likely a data quality issue)
Most hours have no precipitation (75th percentile for both rain and snow is 0)
Traffic volume ranges from 0 to 7,280 vehicles per hour
The mean (3,260) and median (3,380) traffic volumes are similar, suggesting relatively symmetric distribution

Visualizing Traffic Volume Distribution

Let’s create our first visualization to understand traffic patterns:

plt.hist(traffic["traffic_volume"])
plt.xlabel("Traffic Volume")
plt.title("Traffic Volume Distribution")
plt.show()

Traffic Distribution

Learning Insight: Always label your axes and add titles! Your audience shouldn’t have to guess what they’re looking at. A graph without context is just pretty colors.

The histogram reveals a striking bimodal distribution with two distinct peaks:

One peak near 0-1,000 vehicles (low traffic)
Another peak around 4,000-5,000 vehicles (high traffic)

This suggests two distinct traffic regimes. My immediate hypothesis: these correspond to day and night traffic patterns.

Day vs. Night Analysis

Let’s test our hypothesis by splitting the data into day and night periods:

# Convert date_time to datetime format
traffic['date_time'] = pd.to_datetime(traffic['date_time'])

# Create day and night dataframes
day = traffic.copy()[(traffic['date_time'].dt.hour >= 7) &
                     (traffic['date_time'].dt.hour < 19)]

night = traffic.copy()[(traffic['date_time'].dt.hour >= 19) |
                       (traffic['date_time'].dt.hour < 7)]

Learning Insight: I chose 7 AM to 7 PM as “day” hours, which gives us equal 12-hour periods. This is somewhat arbitrary and you might define rush hours differently. I encourage you to experiment with different definitions, like 6 AM to 6 PM, and see how it affects your results. Just keep the periods balanced to avoid skewing your analysis.

Now let’s visualize both distributions side by side:

plt.figure(figsize=(11,3.5))

plt.subplot(1, 2, 1)
plt.hist(day['traffic_volume'])
plt.xlim(-100, 7500)
plt.ylim(0, 8000)
plt.title('Traffic Volume: Day')
plt.ylabel('Frequency')
plt.xlabel('Traffic Volume')

plt.subplot(1, 2, 2)
plt.hist(night['traffic_volume'])
plt.xlim(-100, 7500)
plt.ylim(0, 8000)
plt.title('Traffic Volume: Night')
plt.ylabel('Frequency')
plt.xlabel('Traffic Volume')

plt.show()

Traffic by Day and Night

Perfect! Our hypothesis is confirmed. The low-traffic peak corresponds entirely to nighttime hours, while the high-traffic peak occurs during daytime. Notice how I set the same axis limits for both plots—this ensures fair visual comparison.

Let’s quantify this difference:

print(f"Day traffic mean: {day['traffic_volume'].mean():.0f} vehicles/hour")
print(f"Night traffic mean: {night['traffic_volume'].mean():.0f} vehicles/hour")

Day traffic mean: 4762 vehicles/hour
Night traffic mean: 1785 vehicles/hour

Day traffic is nearly 3x higher than night traffic on average!

Monthly Traffic Patterns

Now let’s explore seasonal patterns by examining traffic by month:

day['month'] = day['date_time'].dt.month
by_month = day.groupby('month').mean(numeric_only=True)

plt.plot(by_month['traffic_volume'], marker='o')
plt.title('Traffic volume by month')
plt.xlabel('Month')
plt.show()

Traffic by Month

The plot reveals:

Winter months (Jan, Feb, Nov, Dec) have notably lower traffic
A dramatic dip in July that seems anomalous

Let’s investigate that July anomaly:

day['year'] = day['date_time'].dt.year
only_july = day[day['month'] == 7]

plt.plot(only_july.groupby('year').mean(numeric_only=True)['traffic_volume'])
plt.title('July Traffic by Year')
plt.show()

Traffic by Year

Learning Insight: This is a perfect example of why exploratory visualization is so valuable. That July dip? It turns out I-94 was completely shut down for several days in July 2016. Those zero-traffic days pulled down the monthly average dramatically. This is a reminder that outliers can significantly impact means so always investigate unusual patterns in your data!

Day of Week Patterns

Let’s examine weekly patterns:

day['dayofweek'] = day['date_time'].dt.dayofweek
by_dayofweek = day.groupby('dayofweek').mean(numeric_only=True)

plt.plot(by_dayofweek['traffic_volume'])

# Add day labels for readability
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
plt.xticks(range(len(days)), days)
plt.xlabel('Day of Week')
plt.ylabel('Traffic Volume')
plt.title('Traffic by Day of Week')
plt.show()

Traffic by Day of Week

Clear pattern: weekday traffic is significantly higher than weekend traffic. This aligns with commuting patterns because most people drive to work Monday through Friday.

Hourly Patterns: Weekday vs. Weekend

Let’s dig deeper into hourly patterns, comparing business days to weekends:

day['hour'] = day['date_time'].dt.hour
business_days = day.copy()[day['dayofweek'] <= 4]  # Monday-Friday
weekend = day.copy()[day['dayofweek'] >= 5]        # Saturday-Sunday

by_hour_business = business_days.groupby('hour').mean(numeric_only=True)
by_hour_weekend = weekend.groupby('hour').mean(numeric_only=True)

plt.figure(figsize=(11,3.5))

plt.subplot(1, 2, 1)
plt.plot(by_hour_business['traffic_volume'])
plt.xlim(6,20)
plt.ylim(1500,6500)
plt.title('Traffic Volume By Hour: Monday–Friday')

plt.subplot(1, 2, 2)
plt.plot(by_hour_weekend['traffic_volume'])
plt.xlim(6,20)
plt.ylim(1500,6500)
plt.title('Traffic Volume By Hour: Weekend')

plt.show()

Traffic by Hour

The patterns are strikingly different:

Weekdays: Clear morning (7 AM) and evening (4-5 PM) rush hour peaks
Weekends: Gradual increase through the day with no distinct peaks
Best time to travel on weekdays: 10 AM (between rush hours)

Weather Impact Analysis

Now let’s explore whether weather conditions affect traffic:

weather_cols = ['clouds_all', 'snow_1h', 'rain_1h', 'temp', 'traffic_volume']
correlations = day[weather_cols].corr()['traffic_volume'].sort_values()
print(correlations)

clouds_all       -0.032932
snow_1h           0.001265
rain_1h           0.003697
temp              0.128317
traffic_volume    1.000000
Name: traffic_volume, dtype: float64

Surprisingly weak correlations! Weather doesn’t seem to significantly impact traffic volume. Temperature shows the strongest correlation at just 13%.

Let’s visualize this with a scatter plot:

plt.figure(figsize=(10,6))
sns.scatterplot(x='traffic_volume', y='temp', hue='dayofweek', data=day)
plt.ylim(230, 320)
plt.show()

Traffic Analysis

Learning Insight: When I first created this scatter plot, I got excited seeing distinct clusters. Then I realized the colors just correspond to our earlier finding—weekends (darker colors) have lower traffic. This is a reminder to always think critically about what patterns actually mean, not just that they exist!

Let’s examine specific weather conditions:

by_weather_main = day.groupby('weather_main').mean(numeric_only=True).sort_values('traffic_volume')

plt.barh(by_weather_main.index, by_weather_main['traffic_volume'])
plt.axvline(x=5000, linestyle="--", color="k")
plt.show()

Traffic Analysis and Weather Impact Analysis

Learning Insight: This is a critical lesson in data analysis and you should always check your sample sizes! Those weather conditions with seemingly high traffic volumes? They only have 1-4 data points each. You can’t draw reliable conclusions from such small samples. The most common weather conditions (clear skies, scattered clouds) have thousands of data points and show average traffic levels.

Key Findings and Conclusions

Through our exploratory visualization, we’ve discovered:

Time-Based Indicators of Heavy Traffic:

Day vs. Night: Daytime (7 AM – 7 PM) has 3x more traffic than nighttime
Day of Week: Weekdays have significantly more traffic than weekends
Rush Hours: 7-8 AM and 4-5 PM on weekdays show highest volumes
Seasonal: Winter months (Jan, Feb, Nov, Dec) have lower traffic volumes

Weather Impact:

Surprisingly minimal correlation between weather and traffic volume
Temperature shows weak positive correlation (13%)
Rain and snow show almost no correlation
This suggests commuters drive regardless of weather conditions

Best Times to Travel:

Avoid: Weekday rush hours (7-8 AM, 4-5 PM)
Optimal: Weekends, nights, or mid-day on weekdays (around 10 AM)

Next Steps

To extend this analysis, consider:

Holiday Analysis: Expand holiday markers to cover all 24 hours and analyze holiday traffic patterns
Weather Persistence: Does consecutive hours of rain/snow affect traffic differently?
Outlier Investigation: Deep dive into the July 2016 shutdown and other anomalies
Predictive Modeling: Build a model to forecast traffic volume based on time and weather
Directional Analysis: Compare eastbound vs. westbound traffic patterns

This project perfectly demonstrates the power of exploratory visualization. We started with a simple question, “what causes heavy traffic?,” and through systematic visualization, uncovered clear patterns. The weather findings surprised me; I expected rain and snow to significantly impact traffic. This reminds us to let data challenge our assumptions!

Pretty graphs are nice, but they’re not the point. The real value of exploratory data analysis comes when you dig deep enough to actually understand what’s happening in your data that will allow you can make smart decisions based on what you find. Whether you’re a commuter planning your route or a city planner optimizing traffic flow, these insights provide actionable intelligence.

If you give this project a go, please share your findings in the Dataquest community and tag me (@Anna_Strahl). I’d love to see what patterns you discover!

Happy analyzing!