Star Wars Survey Analysis Using Python and Pandas

In this project walkthrough, we’ll explore how to clean and analyze real survey data using Python and pandas, while diving into the fascinating world of Star Wars fandom. By working with survey results from FiveThirtyEight, we’ll uncover insights about viewer preferences, film rankings, and demographic trends that go beyond the obvious.

Survey data analysis is a critical skill for any data analyst. Unlike clean, structured datasets, survey responses come with unique challenges: inconsistent formatting, mixed data types, checkbox responses that need strategic handling, and missing values that tell their own story. This project tackles these real-world challenges head-on, preparing you for the messy datasets you’ll encounter in your career.

Throughout this tutorial, we’ll build professional-quality visualizations that tell a compelling story about Star Wars fandom, demonstrating how proper data cleaning and thoughtful visualization design can transform raw survey data into stakeholder-ready insights.

Why This Project Matters

Survey analysis represents a core data science skill applicable across industries. Whether you’re analyzing customer satisfaction surveys, employee engagement data, or market research, the techniques demonstrated here form the foundation of professional data analysis:

Data cleaning proficiency for handling messy, real-world datasets
Boolean conversion techniques for survey checkbox responses
Demographic segmentation analysis for uncovering group differences
Professional visualization design for stakeholder presentations
Insight synthesis for translating data findings into business intelligence

The Star Wars theme makes learning enjoyable, but these skills transfer directly to business contexts. Master these techniques, and you’ll be prepared to extract meaningful insights from any survey dataset that crosses your desk.

By the end of this tutorial, you’ll know how to:

Clean messy survey data by mapping yes/no columns and converting checkbox responses
Handle unnamed columns and create meaningful column names for analysis
Use boolean mapping techniques to avoid data corruption when re-running Jupyter cells
Calculate summary statistics and rankings from survey responses
Create professional-looking horizontal bar charts with custom styling
Build side-by-side comparative visualizations for demographic analysis
Apply object-oriented Matplotlib for precise control over chart appearance
Present clear, actionable insights to stakeholders

Before You Start: Pre-Instruction

To make the most of this project walkthrough, follow these preparatory steps:

Review the Project

Access the project and familiarize yourself with the goals and structure: Star Wars Survey Project

Access the Solution Notebook

You can view and download it here to see what we’ll be covering: Solution Notebook

Prepare Your Environment

If you’re using the Dataquest platform, everything is already set up for you
If working locally, ensure you have Python with pandas, matplotlib, and numpy installed
Download the dataset from the FiveThirtyEight GitHub repository

Prerequisites

Comfortable with Python basics and pandas DataFrames
Familiarity with dictionaries, loops, and methods in Python
Basic understanding of Matplotlib (we’ll use intermediate techniques)
Understanding of survey data structure is helpful, but not required

New to Markdown? We recommend learning the basics to format headers and add context to your Jupyter notebook: Markdown Guide.

Setting Up Your Environment

Let’s begin by importing the necessary libraries and loading our dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

The %matplotlib inline command is Jupyter magic that ensures our plots render directly in the notebook. This is essential for an interactive data exploration workflow.

star_wars = pd.read_csv("star_wars.csv")
star_wars.head()

Our dataset contains survey responses from over 1,100 respondents about their Star Wars viewing habits and preferences.

Learning Insight: Notice the unnamed columns (Unnamed: 4, Unnamed: 5, etc.) and extremely long column names? This is typical of survey data exported from platforms like SurveyMonkey. The unnamed columns actually represent different movies in the franchise, and cleaning these will be our first major task.

The Data Challenge: Survey Structure Explained

Survey data presents unique structural challenges. Consider this typical survey question:

“Which of the following Star Wars films have you seen? Please select all that apply.”

This checkbox-style question gets exported as multiple columns where:

Column 1 contains “Star Wars: Episode I The Phantom Menace” if selected, NaN if not
Column 2 contains “Star Wars: Episode II Attack of the Clones” if selected, NaN if not
And so on for all six films…

This structure makes analysis difficult, so we’ll transform it into clean boolean columns.

Data Cleaning Process

Step 1: Converting Yes/No Responses to Booleans

Survey responses often come as text (“Yes”/”No”) but boolean values (True/False) are much easier to work with programmatically:

yes_no = {"Yes": True, "No": False, True: True, False: False}

for col in [
    "Have you seen any of the 6 films in the Star Wars franchise?",
    "Do you consider yourself to be a fan of the Star Wars film franchise?",
    "Are you familiar with the Expanded Universe?",
    "Do you consider yourself to be a fan of the Star Trek franchise?"
]:
    star_wars[col] = star_wars[col].map(yes_no, na_action='ignore')

Learning Insight: Why the seemingly redundant True: True, False: False entries? This prevents overwriting data when re-running Jupyter cells. Without these entries, if you accidentally run the cell twice, all your True values would become NaN because the mapping dictionary no longer contains True as a key. This is a common Jupyter pitfall that can silently destroy your analysis!

Step 2: Transforming Movie Viewing Data

The trickiest part involves converting the checkbox movie data. Each unnamed column represents whether someone has seen a specific Star Wars episode:

movie_mapping = {
    "Star Wars: Episode I  The Phantom Menace": True,
    np.nan: False,
    "Star Wars: Episode II  Attack of the Clones": True,
    "Star Wars: Episode III  Revenge of the Sith": True,
    "Star Wars: Episode IV  A New Hope": True,
    "Star Wars: Episode V The Empire Strikes Back": True,
    "Star Wars: Episode VI Return of the Jedi": True,
    True: True,
    False: False
}

for col in star_wars.columns[3:9]:
    star_wars[col] = star_wars[col].map(movie_mapping)

Step 3: Strategic Column Renaming

Long, unwieldy column names make analysis difficult. We’ll rename them to something manageable:

star_wars = star_wars.rename(columns={
    "Which of the following Star Wars films have you seen? Please select all that apply.": "seen_1",
    "Unnamed: 4": "seen_2",
    "Unnamed: 5": "seen_3",
    "Unnamed: 6": "seen_4",
    "Unnamed: 7": "seen_5",
    "Unnamed: 8": "seen_6"
})

We’ll also clean up the ranking columns:

star_wars = star_wars.rename(columns={
    "Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.": "ranking_ep1",
    "Unnamed: 10": "ranking_ep2",
    "Unnamed: 11": "ranking_ep3",
    "Unnamed: 12": "ranking_ep4",
    "Unnamed: 13": "ranking_ep5",
    "Unnamed: 14": "ranking_ep6"
})

Analysis: Uncovering the Data Story

Which Movie Reigns Supreme?

Let’s calculate the average ranking for each movie. Remember, in ranking questions, lower numbers indicate higher preference:

mean_ranking = star_wars[star_wars.columns[9:15]].mean().sort_values()
print(mean_ranking)

ranking_ep5    2.513158
ranking_ep6    3.047847
ranking_ep4    3.272727
ranking_ep1    3.732934
ranking_ep2    4.087321
ranking_ep3    4.341317

The results are decisive: Episode V (The Empire Strikes Back) emerges as the clear fan favorite with an average ranking of 2.51. The original trilogy (Episodes IV-VI) significantly outperforms the prequel trilogy (Episodes I-III).

Movie Viewership Patterns

Which movies have people actually seen?

total_seen = star_wars[star_wars.columns[3:9]].sum()
print(total_seen)

seen_1    673
seen_2    571
seen_3    550
seen_4    607
seen_5    758
seen_6    738

Episodes V and VI lead in viewership, while the prequels show notably lower viewing numbers. Episode III has the fewest viewers at 550 respondents.

Professional Visualization: From Basic to Stakeholder-Ready

Creating Our First Chart

Let’s start with a basic visualization and progressively enhance it:

plt.bar(range(6), star_wars[star_wars.columns[3:9]].sum())

This creates a functional chart, but it’s not ready for stakeholders. Let’s upgrade to object-oriented Matplotlib for precise control:

fig, ax = plt.subplots(figsize=(6,3))
rankings = ax.barh(mean_ranking.index, mean_ranking, color='#fe9b00')

ax.set_facecolor('#fff4d6')
ax.set_title('Average Ranking of Each Movie')

for spine in ['top', 'right', 'bottom', 'left']:
    ax.spines[spine].set_visible(False)

ax.invert_yaxis()
ax.text(2.6, 0.35, '*Lowest rank is the mostn liked', fontstyle='italic')

plt.show()

Learning Insight: Think of fig as your canvas and ax as a panel or chart area on that canvas. Object-oriented Matplotlib might seem intimidating initially, but it provides precise control over every visual element. The fig object handles overall figure properties while ax controls individual chart elements.

Advanced Visualization: Gender Comparison

Our most sophisticated visualization compares rankings and viewership by gender using side-by-side bars:

# Create gender-based dataframes
males = star_wars[star_wars["Gender"] == "Male"]
females = star_wars[star_wars["Gender"] == "Female"]

# Calculate statistics for each gender
male_ranking_avgs = males[males.columns[9:15]].mean()
female_ranking_avgs = females[females.columns[9:15]].mean()
male_tot_seen = males[males.columns[3:9]].sum()
female_tot_seen = females[females.columns[3:9]].sum()

# Create side-by-side comparison
ind = np.arange(6)
height = 0.35
offset = ind + height

fig, ax = plt.subplots(1, 2, figsize=(8,4))

# Rankings comparison
malebar = ax[0].barh(ind, male_ranking_avgs, color='#fe9b00', height=height)
femalebar = ax[0].barh(offset, female_ranking_avgs, color='#c94402', height=height)
ax[0].set_title('Movie Rankings by Gender')
ax[0].set_yticks(ind + height / 2)
ax[0].set_yticklabels(['Episode 1', 'Episode 2', 'Episode 3', 'Episode 4', 'Episode 5', 'Episode 6'])
ax[0].legend(['Men', 'Women'])

# Viewership comparison
male2bar = ax[1].barh(ind, male_tot_seen, color='#ff1947', height=height)
female2bar = ax[1].barh(offset, female_tot_seen, color='#9b052d', height=height)
ax[1].set_title('# of Respondents by Gender')
ax[1].set_xlabel('Number of Respondents')
ax[1].legend(['Men', 'Women'])

plt.show()

Learning Insight: The offset technique (ind + height) is the key to creating side-by-side bars. This shifts the female bars slightly down from the male bars, creating the comparative effect. The same axis limits ensure fair visual comparison between charts.

Key Findings and Insights

Through our systematic analysis, we’ve discovered:

Movie Preferences:

Episode V (Empire Strikes Back) emerges as the definitive fan favorite across all demographics
The original trilogy significantly outperforms the prequels in both ratings and viewership
Episode III receives the lowest ratings and has the fewest viewers

Gender Analysis:

Both men and women rank Episode V as their clear favorite
Gender differences in preferences are minimal but consistently favor male engagement
Men tended to rank Episode IV slightly higher than women
More men have seen each of the six films than women, but the patterns remain consistent

Demographic Insights:

The ranking differences between genders are negligible across most films
Episodes V and VI represent the franchise’s most universally appealing content
The stereotype about gender preferences in sci-fi shows some support in engagement levels, but taste preferences remain remarkably similar

The Stakeholder Summary

Every analysis should conclude with clear, actionable insights. Here’s what stakeholders need to know:

Episode V (Empire Strikes Back) is the definitive fan favorite with the lowest average ranking across all demographics
Gender differences in movie preferences are minimal, challenging common stereotypes about sci-fi preferences
The original trilogy significantly outperforms the prequels in both critical reception and audience reach
Male respondents show higher overall engagement with the franchise, having seen more films on average

Beyond This Analysis: Next Steps

This dataset contains rich additional dimensions worth exploring:

Character Analysis: Which characters are universally loved, hated, or controversial across the fanbase?
The “Han Shot First” Debate: Analyze this infamous Star Wars controversy and what it reveals about fandom
Cross-Franchise Preferences: Explore correlations between Star Wars and Star Trek fandom
Education and Age Correlations: Do viewing patterns vary by demographic factors beyond gender?

This project perfectly balances technical skill development with engaging subject matter. You’ll emerge with a polished portfolio piece demonstrating data cleaning proficiency, advanced visualization capabilities, and the ability to transform messy survey data into actionable business insights.

Whether you’re team Jedi or Sith, the data tells a compelling story. And now you have the skills to tell it beautifully.

If you give this project a go, please share your findings in the Dataquest community and tag me (@Anna_Strahl). I’d love to see what patterns you discover!

More Projects to Try

We have some other project walkthrough tutorials you may also enjoy: