Home » How to Build a Lightweight Data Pipeline with Airtable and Python

How to Build a Lightweight Data Pipeline with Airtable and Python

How to Build a Lightweight Data Pipeline with Airtable and Python
Image by Editor | ChatGPT

 

Introduction

 
Airtable not only offers a flexible, spreadsheet-like interface for data storage and analysis, it also provides an API for programmatic interaction. In other words, you can connect it to external tools and technologies — for instance, Python — to build data pipelines or processing workflows, bringing your results back to your Airtable database (or simply “base”, in Airtable jargon).

This article demonstrates how to create a simple, ETL-like pipeline using the Airtable Python API. We will stick to the free tier, ensuring the approach works without paid features.

 

Airtable Dataset Setup

 
While the pipeline built in this article can be easily adapted to a variety of datasets, for those new to Airtable and needing an Airtable project and stored dataset as a starting point, we recommend you follow this recent introductory tutorial to Airtable and create a tabular dataset called “Customers”, containing 200 rows and the following columns (see image):

 

Customers dataset/table in AirtableCustomers dataset/table in Airtable
Customers dataset/table in Airtable | Image by Author

 

Airtable-Python Data Pipeline

 
In Airtable, go to your user avatar — at the time of writing, it is the circled avatar located on the bottom-left corner of the app interface — and select “Builder Hub”. In the new screen (see screenshot below), click on “Personal access tokens”, then on “Create token”. Give it a name, and make sure you add at least these two scopes: data.records:read and data.records:write. Likewise, select the base where your customers table is located in the “Access” section, so that your token has configured access to this base.

 

Creating Airtable API tokenCreating Airtable API token
Creating an Airtable API token | Image by Author

 

Once the token has been created, copy and store it carefully in a safe place, as it will be shown only once. We’ll need it later. The token starts with pat followed by a long alphanumeric code.

Another key piece of information we will need to build our Python-based pipeline that interacts with Airtable is the ID of our base. Go back to your base in the Airtable web interface, and once there, you should see that its URL in the browser has a syntax like: https://airtable.com/app[xxxxxx]/xxxx/xxxx. The part we are interested in copying is the app[xxxx] ID contained between two consecutive slashes (/): that’s the base ID we will need. 

With this in hand, and assuming you already have a populated table called “Customers” in your base, we are ready to start our Python program. I will be using a notebook for coding it. If you are using an IDE, you may need to slightly change the part where the three Airtable environment variables are defined, to have them read from an .env file instead. In this version, for simplicity and ease of illustration, we will directly define them in our notebook. Let’s start by installing the necessary dependencies:

!pip install pyairtable python-dotenv

 

Next, we define the Airtable environment variables. Notice that for the first two, you need to replace the value with your actual access token and base ID, respectively:

import os
from dotenv import load_dotenv # Necessary only if reading variables from a .env file
from pyairtable import Api, Table
import pandas as pd

PAT = "pat-xxx" # Your PAT (Personal Access Token) is pasted here
BASE_ID = "app-xxx" # Your Airtable Base ID is pasted here
TABLE_NAME = "Customers"

api = Api(PAT)
table = Table(PAT, BASE_ID, TABLE_NAME)

 

We have just set up an instance of the Python Airtable API and instantiated a connection point to the customers table in our base. Now, this is how we read the entire dataset contained in our Airtable table and load it into a Pandas DataFrame. You just need to be careful to use the exact column names from the source table for the string arguments inside the get() method calls:

rows = []
for rec in table.all():  # honors 5 rps; auto-retries on 429s
    fields = rec.get("fields", {})
    rows.append({
        "id": rec["id"],
        "CustomerID": fields.get("CustomerID"),
        "Gender": fields.get("Gender"),
        "Age": fields.get("Age"),
        "Annual Income (k$)": fields.get("Annual Income (k$)"),
        "Spending Score (1-100)": fields.get("Spending Score (1-100)"),
        "Income class": fields.get("Income Class"),
    })

df = pd.DataFrame(rows)

 

Once the data has been loaded, it is time to apply a simple transformation. For simplicity, we will just apply one transformation, but we could apply as many as needed, just as we would usually do when preprocessing or cleaning datasets with Pandas. We will create a new binary attribute, called Is High Value, to denote high-value customers, i.e., those whose income and spending score are both high:

def high_value(row):
    try:
        return (row["Spending Score (1-100)"] >= 70) and (row["Annual Income (k$)"] >= 70)
    except TypeError:
        return False

df["Is High Value"] = df.apply(high_value, axis=1)
df.head()

 

Resulting dataset:

 

Airtable data transformation with Python and PandasAirtable data transformation with Python and Pandas
Airtable data transformation with Python and Pandas | Image by Author

 

Finally, it is time to write the changes back to Airtable by incorporating the new data associated with the new column. There is a little caveat: we first need to manually create a new column named “High Value” in our Airtable customers table, with its type set to “Checkbox” (the equivalent of binary categorical attributes). Once this blank column has been created, run the following code in your Python program, and the new data will be automatically added to Airtable!

updates = []
for _, r in df.iterrows():
    if pd.isna(r["id"]):
        continue
    updates.append({
        "id": r["id"],
        "fields": {
            "High Value": bool(r["Is High Value"])
        }
    })

if updates:
    table.batch_update(updates)

 

Time to go back to Airtable and see what changed in our source customers table! If at first glance you see no changes and the new column still seems empty, don’t panic just yet. Not many customers are labeled as “high value”, and you may need to scroll down a little to see some labeled with a green tick sign:

 

Updated customers tableUpdated customers table
Updated customers table | Image by Author

 

That’s it! You just built your own lightweight, ETL-like data pipeline based on a bidirectional interaction between Airtable and Python. Well done!

 

Wrapping Up

 
This article focused on showcasing data capabilities with Airtable, a versatile and user-friendly cloud-based platform for data management and analysis that combines features of spreadsheets and relational databases with AI-powered functions. In particular, we showed how to run a lightweight data transformation pipeline with the Airtable Python API that reads data from Airtable, transforms it, and loads it back to Airtable — all within the capabilities and limitations of Airtable’s free version.
 
 

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *