
Data is the asset that drives our work as data professionals. Without proper data, we cannot perform our tasks, and our business will fail to gain a competitive advantage. Thus, securing suitable data is crucial for any data professional, and data pipelines are the systems designed for this purpose.
Data pipelines are systems designed to move and transform data from one source to another. These systems are part of the overall infrastructure for any business that relies on data, as they guarantee that our data is reliable and always ready to use.
Building a data pipeline may sound complex, but a few simple tools are sufficient to create reliable data pipelines with just a few lines of code. In this article, we will explore how to build a straightforward data pipeline using Python and Docker that you can apply in your everyday data work.
Let’s get into it.
Building the Data Pipeline
Before we build our data pipeline, let’s understand the concept of ETL, which stands for Extract, Transform, and Load. ETL is a process where the data pipeline performs the following actions:
- Extract data from various sources.
- Transform data into a valid format.
- Load data into an accessible storage location.
ETL is a standard pattern for data pipelines, so what we build will follow this structure.
With Python and Docker, we can build a data pipeline around the ETL process with a simple setup. Python is a valuable tool for orchestrating any data flow activity, while Docker is useful for managing the data pipeline application’s environment using containers.
Let’s set up our data pipeline with Python and Docker.
Step 1: Preparation
First, we must nsure that we have Python and Docker installed on our system (we will not cover this here).
For our example, we will use the heart attack dataset from Kaggle as the data source to develop our ETL process.
With everything in place, we will prepare the project structure. Overall, the simple data pipeline will have the following skeleton:
simple-data-pipeline/
├── app/
│ └── pipeline.py
├── data/
│ └── Medicaldataset.csv
├── Dockerfile
├── requirements.txt
└── docker-compose.yml
There is a main folder called simple-data-pipeline
, which contains:
- An
app
folder containing thepipeline.py
file. - A
data
folder containing the source data (Medicaldataset.csv
). - The
requirements.txt
file for environment dependencies. - The
Dockerfile
for the Docker configuration. - The
docker-compose.yml
file to define and run our multi-container Docker application.
We will first fill out the requirements.txt
file, which contains the libraries required for our project.
In this case, we will only use the following library:
In the next section, we will set up the data pipeline using our sample data.
Step 2: Set up the Pipeline
We will set up the Python pipeline.py
file for the ETL process. In our case, we will use the following code.
import pandas as pd
import os
input_path = os.path.join("/data", "Medicaldataset.csv")
output_path = os.path.join("/data", "CleanedMedicalData.csv")
def extract_data(path):
df = pd.read_csv(path)
print("Data Extraction completed.")
return df
def transform_data(df):
df_cleaned = df.dropna()
df_cleaned.columns = [col.strip().lower().replace(" ", "_") for col in df_cleaned.columns]
print("Data Transformation completed.")
return df_cleaned
def load_data(df, output_path):
df.to_csv(output_path, index=False)
print("Data Loading completed.")
def run_pipeline():
df_raw = extract_data(input_path)
df_cleaned = transform_data(df_raw)
load_data(df_cleaned, output_path)
print("Data pipeline completed successfully.")
if __name__ == "__main__":
run_pipeline()
The pipeline follows the ETL process, where we load the CSV file, perform data transformations such as dropping missing data and cleaning the column names, and load the cleaned data into a new CSV file. We wrapped these steps into a single run_pipeline
function that executes the entire process.
Step 3: Set up the Dockerfile
With the Python pipeline file ready, we will fill in the Dockerfile
to set up the configuration for the Docker container using the following code:
FROM python:3.10-slim
WORKDIR /app
COPY ./app /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
CMD ["python", "pipeline.py"]
In the code above, we specify that the container will use Python version 3.10 as its environment. Next, we set the container’s working directory to /app
and copy everything from our local app
folder into the container’s app
directory. We also copy the requirements.txt
file and execute the pip installation within the container. Finally, we specify the command to run the Python script when the container starts.
With the Dockerfile
ready, we will prepare the docker-compose.yml
file to manage the overall execution:
version: '3.9'
services:
data-pipeline:
build: .
container_name: simple_pipeline_container
volumes:
- ./data:/data
The YAML file above, when executed, will build the Docker image from the current directory using the available Dockerfile
. We also mount the local data
folder to the data
folder within the container, making the dataset accessible to our script.
Executing the Pipeline
With all the files ready, we will execute the data pipeline in Docker. Go to the project root folder and run the following command in your command prompt to build the Docker image and execute the pipeline.
docker compose up --build
If you run this successfully, you will see an informational log like the following:
✔ data-pipeline Built 0.0s
✔ Network simple_docker_pipeline_default Created 0.4s
✔ Container simple_pipeline_container Created 0.4s
Attaching to simple_pipeline_container
simple_pipeline_container | Data Extraction completed.
simple_pipeline_container | Data Transformation completed.
simple_pipeline_container | Data Loading completed.
simple_pipeline_container | Data pipeline completed successfully.
simple_pipeline_container exited with code 0
If everything is executed successfully, you will see a new CleanedMedicalData.csv
file in your data folder.
Congratulations! You have just created a simple data pipeline with Python and Docker. Try using various data sources and ETL processes to see if you can handle a more complex pipeline.
Conclusion
Understanding data pipelines is crucial for every data professional, as they are essential for acquiring the right data for their work. In this article, we explored how to build a simple data pipeline using Python and Docker and learned how to execute it.
I hope this has helped!
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.