Home » Build Your Own Simple Data Pipeline with Python and Docker

Build Your Own Simple Data Pipeline with Python and Docker

Build Your Own Simple Data Pipeline with Python and DockerImage by Author | Ideogram

 

Data is the asset that drives our work as data professionals. Without proper data, we cannot perform our tasks, and our business will fail to gain a competitive advantage. Thus, securing suitable data is crucial for any data professional, and data pipelines are the systems designed for this purpose.

Data pipelines are systems designed to move and transform data from one source to another. These systems are part of the overall infrastructure for any business that relies on data, as they guarantee that our data is reliable and always ready to use.

Building a data pipeline may sound complex, but a few simple tools are sufficient to create reliable data pipelines with just a few lines of code. In this article, we will explore how to build a straightforward data pipeline using Python and Docker that you can apply in your everyday data work.

Let’s get into it.

 

Building the Data Pipeline

 
Before we build our data pipeline, let’s understand the concept of ETL, which stands for Extract, Transform, and Load. ETL is a process where the data pipeline performs the following actions:

  • Extract data from various sources. 
  • Transform data into a valid format. 
  • Load data into an accessible storage location.

ETL is a standard pattern for data pipelines, so what we build will follow this structure. 

With Python and Docker, we can build a data pipeline around the ETL process with a simple setup. Python is a valuable tool for orchestrating any data flow activity, while Docker is useful for managing the data pipeline application’s environment using containers.

Let’s set up our data pipeline with Python and Docker. 

 

Step 1: Preparation

First, we must nsure that we have Python and Docker installed on our system (we will not cover this here).

For our example, we will use the heart attack dataset from Kaggle as the data source to develop our ETL process.  

With everything in place, we will prepare the project structure. Overall, the simple data pipeline will have the following skeleton:

simple-data-pipeline/
├── app/
│   └── pipeline.py
├── data/
│   └── Medicaldataset.csv
├── Dockerfile
├── requirements.txt
└── docker-compose.yml

 

There is a main folder called simple-data-pipeline, which contains:

  • An app folder containing the pipeline.py file.
  • A data folder containing the source data (Medicaldataset.csv).
  • The requirements.txt file for environment dependencies.
  • The Dockerfile for the Docker configuration.
  • The docker-compose.yml file to define and run our multi-container Docker application.

We will first fill out the requirements.txt file, which contains the libraries required for our project.

In this case, we will only use the following library:

 

In the next section, we will set up the data pipeline using our sample data.

 

Step 2: Set up the Pipeline

We will set up the Python pipeline.py file for the ETL process. In our case, we will use the following code.

import pandas as pd
import os

input_path = os.path.join("/data", "Medicaldataset.csv")
output_path = os.path.join("/data", "CleanedMedicalData.csv")

def extract_data(path):
    df = pd.read_csv(path)
    print("Data Extraction completed.")
    return df

def transform_data(df):
    df_cleaned = df.dropna()
    df_cleaned.columns = [col.strip().lower().replace(" ", "_") for col in df_cleaned.columns]
    print("Data Transformation completed.")
    return df_cleaned

def load_data(df, output_path):
    df.to_csv(output_path, index=False)
    print("Data Loading completed.")

def run_pipeline():
    df_raw = extract_data(input_path)
    df_cleaned = transform_data(df_raw)
    load_data(df_cleaned, output_path)
    print("Data pipeline completed successfully.")

if __name__ == "__main__":
    run_pipeline()

 

The pipeline follows the ETL process, where we load the CSV file, perform data transformations such as dropping missing data and cleaning the column names, and load the cleaned data into a new CSV file. We wrapped these steps into a single run_pipeline function that executes the entire process.

 

Step 3: Set up the Dockerfile

With the Python pipeline file ready, we will fill in the Dockerfile to set up the configuration for the Docker container using the following code:

FROM python:3.10-slim

WORKDIR /app
COPY ./app /app
COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

CMD ["python", "pipeline.py"]

 

In the code above, we specify that the container will use Python version 3.10 as its environment. Next, we set the container’s working directory to /app and copy everything from our local app folder into the container’s app directory. We also copy the requirements.txt file and execute the pip installation within the container. Finally, we specify the command to run the Python script when the container starts.

With the Dockerfile ready, we will prepare the docker-compose.yml file to manage the overall execution:

version: '3.9'

services:
  data-pipeline:
    build: .
    container_name: simple_pipeline_container
    volumes:
      - ./data:/data

 

The YAML file above, when executed, will build the Docker image from the current directory using the available Dockerfile. We also mount the local data folder to the data folder within the container, making the dataset accessible to our script.

 

Executing the Pipeline

 
With all the files ready, we will execute the data pipeline in Docker. Go to the project root folder and run the following command in your command prompt to build the Docker image and execute the pipeline.

docker compose up --build

 

If you run this successfully, you will see an informational log like the following:

 ✔ data-pipeline                           Built                                                                                   0.0s 
 ✔ Network simple_docker_pipeline_default  Created                                                                                 0.4s 
 ✔ Container simple_pipeline_container     Created                                                                                 0.4s 
Attaching to simple_pipeline_container
simple_pipeline_container  | Data Extraction completed.
simple_pipeline_container  | Data Transformation completed.
simple_pipeline_container  | Data Loading completed.
simple_pipeline_container  | Data pipeline completed successfully.
simple_pipeline_container exited with code 0

 

If everything is executed successfully, you will see a new CleanedMedicalData.csv file in your data folder. 

Congratulations! You have just created a simple data pipeline with Python and Docker. Try using various data sources and ETL processes to see if you can handle a more complex pipeline.

 

Conclusion

 
Understanding data pipelines is crucial for every data professional, as they are essential for acquiring the right data for their work. In this article, we explored how to build a simple data pipeline using Python and Docker and learned how to execute it.

I hope this has helped!
 
 

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *