Home » Integrating DuckDB & Python: An Analytics Guide

Integrating DuckDB & Python: An Analytics Guide


Image by Author

 

DuckDB is a fast, in-process analytical database designed for modern data analysis. It runs directly from your Python script, which means that there is no separate server needed, and it excels at complex queries thanks to its columnar storage and vectorized execution.

As understanding how to deal with data is becoming more important, today I want to show you how to build a Python workflow with DuckDB and explore its key features.

Let’s dive in!

 

What Is DuckDB?

 
DuckDB is a free, open-source, in-process OLAP database built for fast, local analytics. Unlike traditional databases that run as external services, DuckDB runs inside your application, with no server required. As an OLAP system, DuckDB stores data in columns (not rows like OLTP systems), making it highly efficient for analytical queries such as joins, aggregations, and groupings.

Think of DuckDB as a lightweight, analytics-optimized version of SQLite, bringing the simplicity of local databases together with the power of modern data warehousing. And this leads us to the following natural question…

 

What Are DuckDB’s Main Features?

 

Blazing-Fast Analytical Queries

DuckDB delivers impressive performance for OLAP workloads, often surprising users familiar with traditional databases like PostgreSQL. Unlike conventional OLAP systems that can be sluggish due to processing large volumes of data, DuckDB leverages a columnar, vectorized execution engine. This design optimizes CPU cache usage and significantly accelerates analytical query performance.

 

Native SQL Support + Seamless Language Integration

DuckDB offers full support for complex SQL queries and exposes APIs in multiple languages, including Java, C, and C++. Its tight integration with Python and R makes it ideal for interactive data analysis. You can write queries directly in your preferred environment, with extra SQL syntax enhancements (e.g., EXCLUDE, REPLACE, and ALL) to simplify query writing.

And the best part is that DuckDB is completely self-contained, with no external dependencies or setup headaches.

 

Free And Open Source

DuckDB is fully open-source and actively maintained by a growing community of contributors. This ensures rapid feature development and bug fixes. And yes, it’s free to use. While future licensing changes are always a possibility, for now, you get a powerful analytics engine at zero cost.

Now that we know its main features, let’s get started with it!

 

Getting Started With DuckDB

 
The installation process for DuckDB depends slightly on your environment, but overall, it’s quick and simple. Since DuckDB is an embedded database engine with no server requirements or external dependencies, setup typically takes just a few lines of code. You can find the complete installation guide in the official DuckDB documentation.

 

Prerequisites

Before diving in, ensure you have the following:

  • Python 3.13 or later installed
  • A basic understanding of SQL and data analysis in Python

You can easily install DuckDB in your environment by executing the following command:

 

Working With DuckDB in Python

Once you’ve installed DuckDB, it’s pretty simple to get started. You simply import DuckDB into your environment, then connect to an existing database or create a new one if required.

For example:

import duckdb 
connection = duckdb.connect()

 

If no database file is provided to the connect() method, DuckDB will create a new in-memory database by default. That said, the simplest way to start running SQL queries is by using the sql() method directly.

# Source: Basic API usage - https://duckdb.org/docs/api/python/overview.html
import duckdb
duckdb.sql('SELECT 42').show()

 

Running this command initializes a global in-memory DuckDB instance within the Python module and returns a relation, a symbolic representation of the query.

Importantly, the query itself isn’t executed until you explicitly request the result, as shown below:

# Source: Execute SQL - https://duckdb.org/docs/guides/python/execute_sql.html
results = duckdb.sql('SELECT 42').fetchall()
print(results)

"""
[(42,)]
"""

 

Let’s now work with some real data. DuckDB supports a wide range of file formats, including CSV, JSON, and Parquet, and loading them is simple.

You can see how straightforward it is in the example below:

# Source: Python API - https://duckdb.org/docs/api/python/overview.html 
import duckdb
duckdb.read_csv('example.csv') # read a CSV file into a Relation
duckdb.read_parquet('example.parquet')# read a Parquet file into a Relation
duckdb.read_json('example.json') # read a JSON file into a Relation
duckdb.sql('SELECT * FROM "example.csv"')     # directly query a CSV file

 

Working With External Data Sources In DuckDB

 
One of DuckDB’s standout features is its ability to query external data files directly, without needing to import them into a database or load entire datasets into memory. Unlike traditional databases that require data to be ingested first, DuckDB supports a “zero-copy” execution model, allowing it to read only the data required for a given query.

This approach brings several key advantages:

  • Minimal memory usage: Only the relevant portions of the file are read into memory.
  • No import/export overhead: Query your data in place—no need to move or duplicate it.
  • Streamlined workflows: Easily query across multiple files and formats using a single SQL statement.

To exemplify the usage of DuckDB, we will be using a simple CSV file that you can obtain from the following Kaggle link .

To query the data, we can easily define a simple query that points out to our file path.

# Query data directly from a CSV file
result = duckdb.query(f"SELECT * FROM '{source}'").fetchall()
print(result)

 

Now we can easily handle data using SQL-like logic directly with DuckDB.

 

Filtering Rows

To focus on specific subsets of data, use the WHERE clause in DuckDB. It filters rows based on conditions using comparison operators (>, <, =, <>, etc.) and logical operators (AND, OR, NOT) for more complex expressions.


# Select only students with a score above 80
result = duckdb.query(f"SELECT * FROM '{source}' WHERE total_passengers > 500").fetchall()
result

 

Sorting Results

Use the ORDER BY clause to sort results by one or more columns. It defaults to ascending (ASC), but you can specify descending (DESC). To sort by multiple columns, separate them with commas.

#Sort months by number of passengers
sorted_result = duckdb.query(f"SELECT * FROM '{source}' ORDER BY total_passengers DESC ").fetchall()
print("nMonths sorted by total traffic:")
print(sorted_result)

 

Adding Calculated Columns

Create new columns in your query using expressions and the AS keyword. Use arithmetic operators or built-in functions to transform data—these columns appear in the results but don’t affect the original file.

# Add 10 bonus points to each score
bonus_result = duckdb.query(f"""
   SELECT
       month,
       total_passengers,
       total_passengers/1000 AS traffic_in_thousands
   FROM '{source}'
""").fetchall()
print("nScores with 10 bonus points:")
print(bonus_result)

 

Using CASE Expressions

For more complex transformations, SQL provides the CASE expression. This works similarly to if-else statements in programming languages, allowing you to apply conditional logic in your queries.

segmented_result = duckdb.query(f"""
   SELECT
       month,
       total_passengers,
       CASE
           WHEN total_passengers >= 100 THEN 'HIGH'
           WHEN total_passengers >= 50 THEN 'MEDIUM'
           ELSE 'LOW'
       END AS affluency
   FROM '{source}'
""").fetchall()
print("nMonth by affluency of passangers")
print(segmented_result)

 

Conclusion

 
DuckDB is a high-performance OLAP database built for data professionals who need to explore and analyze large datasets efficiently. Its in-process SQL engine runs complex analytical queries directly in your environment — no separate server is required. With seamless support for Python, R, Java, C++, and more, DuckDB fits naturally into your existing workflow, no matter your preferred language.

You can go check the full code on the following GitHub repository.
 
 

Josep Ferrer is an analytics engineer from Barcelona. He graduated in physics engineering and is currently working in the data science field applied to human mobility. He is a part-time content creator focused on data science and technology. Josep writes on all things AI, covering the application of the ongoing explosion in the field.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *