Processing Millions of Events from Thousands of Aircraft with One Declarative Pipeline

Every second, tens of thousands of aircraft generate IoT events across the globe—from a small Cessna carrying four tourists over the Grand Canyon to an Airbus A380 departing Frankfurt with 570 passengers, broadcasting location, altitude, and flight path on its transatlantic route to New York.

Like air traffic controllers who must continuously update complex flight paths as weather and traffic conditions evolve, data engineers require platforms that can handle high-throughput, low-latency, mission-critical avionic data streams. For neither of these mission-critical systems is pausing processing an option.

Building such data pipelines meant wrestling with hundreds of lines of code, managing compute clusters, and configuring complex permissions to get ETL working. Those days are over. With Lakeflow Declarative Pipelines, you can build production-ready streaming pipelines in minutes using plain SQL (or Python, if you prefer that), running on serverless compute with unified governance and fine-grained access control.

This article walks you through the architecture of transportation, logistics, and freight use cases. It demonstrates a pipeline that ingests real-time avionics data from all aircraft currently flying over North America, processing live flight status updates with just a few lines of declarative code.

Real-World Streaming at Scale

Most streaming tutorials promise real-world examples but deliver synthetic datasets that overlook production-scale volume, velocity and variety. The aviation industry processes some of the world’s most demanding real-time data streams–aircraft positions update multiple times per second with low-latency requirements for safety-critical applications.

The OpenSky Network, a crowd-sourced project from researchers at the University of Oxford and other research institutes, provides free access to live avionics data for non-commercial use. This allows us to demonstrate enterprise-grade streaming architectures with genuinely compelling data.

While tracking flights on your phone is casual fun, the same data stream powers billion-dollar logistics operations: port authorities coordinate ground operations, delivery services integrate flight schedules into notifications, and freight forwarders track cargo movements across global supply chains.

Architectural Innovation: Custom Data Sources as First-Class Citizens

Traditional architectures require significant coding and infrastructure overhead to connect external systems to your data platform. To ingest third-party data streams, you typically need to pay for third party SaaS solutions or develop custom connectors with authentication management, flow control and complex error handling.

In the Data Intelligence Platform, Lakeflow Connect addresses this complexity for enterprise business systems like Salesforce, Workday, and ServiceNow by providing an ever-growing number of managed connectors that automatically handle authentication, change data capture, and error recovery.

The OSS foundation of Lakeflow, Apache Spark™, comes with an extensive ecosystem of built-in data sources that can read from dozens of technical systems: from cloud storage formats like Parquet, Iceberg, or Delta.io to message buses like Apache Kafka, Pulsar or Amazon Kinesis. For example, you can easily connect to a Kafka topic using spark.readStream.format("kafka"), and this familiar syntax works consistently across all supported data sources.

However, there’s a gap when accessing third-party systems via arbitrary APIs, falling between enterprise systems that Lakeflow Connect covers and Spark’s technology-based connectors. Some services provide REST APIs that don’t fit either category, yet organizations need this data in their lakehouse.

PySpark custom data sources fill this gap with a clean abstraction layer that makes API integration as simple as any other data source.

For this blog, I implemented a PySpark custom data source for the OpenSky Network and made it available as a simple pip install. The data source encapsulates API calls, authentication, and error handling. You simply replace “kafka” with “opensky” in the example above, and the rest works identically:

Using this abstraction, teams can focus on business logic rather than integration overhead, while maintaining the same developer experience across all data sources.

The custom data source pattern is a generic architectural solution that works seamlessly for any external API—financial market data, IoT sensor networks, social media streams, or predictive maintenance systems. Developers can leverage the familiar Spark DataFrame API without worrying about HTTP connection pooling, rate limiting, or authentication tokens.

This approach is particularly valuable for third party systems where the integration effort justifies building a reusable connector, but an enterprise-grade managed solution does not exist.

Streaming Tables: Exactly-Once Ingestion Made Simple

Now that we’ve established how custom data sources handle API connectivity, let’s examine how streaming tables process this data reliably. IoT data streams present specific challenges around duplicate detection, late-arriving events, and processing guarantees. Traditional streaming frameworks require careful coordination between multiple components to achieve exactly-once semantics.

Streaming tables in Lakeflow Declarative Pipelines solve this complexity through declarative semantics. Lakeflow excels at both low-latency processing and high-throughput applications.

This may be one of the first articles to showcase streaming tables powered by custom data sources, but it won’t be the last. With declarative pipelines and PySpark data sources now open source and broadly available in Apache Spark™, these capabilities are becoming accessible to developers everywhere.

The code above accesses the avionics data as a data stream. The same code works identically for streaming and batch processing. With Lakeflow, you can configure the pipeline’s execution mode and trigger the execution using a workflow such as Lakeflow Jobs.

This brief implementation demonstrates the power of declarative programming. The code above results in a streaming table with continuously ingested live avionics data — it’s the complete implementation that streams data from some 10,000 planes currently flying over the U.S. (depending on the time of day). The platform handles everything else – authentication, incremental processing, error recovery, and scaling.

Every detail, such as the planes’ call sign, current location, altitude, speed, direction, and destination, is ingested into the streaming table. The example is not a code-like snippet, but an implementation that delivers real, actionable data at scale.

The full application can easily be written interactively, from scratch with the new Lakeflow Declarative Pipelines Editor. The new editor uses files by default, so you can add the datasource package pyspark-data-sources directly in the editor under Settings/Environments instead of running pip install in a notebook.

Behind the scenes, Lakeflow manages the streaming infrastructure: automatic checkpointing ensures failure recovery, incremental processing eliminates redundant computation, and exactly-once guarantees prevent data duplication. Data engineers write business logic; the platform ensures operational excellence.

Optional Configuration

The example above works independently and is fully functional out of the box. However, production deployments typically require additional configuration. In real-world scenarios, users may need to specify the geographic region for OpenSky data collection, enable authentication to increase API rate limits, and enforce data quality constraints to prevent bad data from entering the system.

Geographic Regions

You can track flights over specific regions by specifying predefined bounding boxes for major continents and geographic areas. The data source includes regional filters such as AFRICA, EUROPE, and NORTH_AMERICA, among others, plus a global option for worldwide coverage. These built-in regions help you control the volume of data returned while focusing your analysis on geographically relevant areas for your specific use case.

Rate Limiting and OpenSky Network Authentication

Authentication with the OpenSky Network provides significant benefits for production deployments. The OpenSky API increases rate limits from 100 calls per day (anonymous) to 4,000 calls per day (authenticated), essential for real-time flight tracking applications.

To authenticate, register for API credentials at https://opensky-network.org and provide your client_id and client_secret as options when configuring the data source. These credentials should be stored as Databricks secrets rather than hardcoded in your code for security.

Note that you can raise this limit to 8,000 calls daily if you feed your data to the OpenSky Network. This fun project involves putting an ADS-B antenna on your balcony to contribute to this crowd-sourced initiative.

Data Quality with Expectations

Data quality is critical for reliable analytics. Declarative Pipeline expectations define rules to automatically validate streaming data, ensuring only clean records reach your tables.

These expectations can catch missing values, invalid formats, or business rule violations. You can drop bad records, quarantine them for review, or halt the pipeline when validation fails. The code in the next section demonstrates how to configure region selection, authentication, and data quality validation for production use.

Revised Streaming Table Example

The implementation below shows an example of the streaming table with region parameters and authentication, demonstrating how the data source handles geographic filtering and API credentials. Data quality validation checks whether the aircraft ID (managed by the International Civil Aviation Organization – ICAO) and the plane’s coordinates are set.

Materialized Views: Precomputed results for Analytics

Real-time analytics on streaming data traditionally requires complex architectures combining stream processing engines, caching layers, and analytical databases. Each component introduces operational overhead, consistency challenges, and additional failure modes.

Materialized views in Lakeflow Declarative Pipelines reduce this architectural overhead by abstracting the underlying runtime with serverless compute. A simple SQL statement creates a materialized view containing precomputed results that update automatically as new data arrives. These results are optimized for downstream consumption by dashboards, Databricks Apps, or additional analytics tasks in a workflow implemented with Lakeflow Jobs.

This materialized view aggregates aircraft status updates from the streaming table, generating global statistics on flight patterns, speeds, and altitudes. As new IoT events arrive, the view updates incrementally on the serverless Lakeflow platform. By processing only a few thousand changes—rather than recomputing nearly a billion events each day—processing time and costs are dramatically reduced.

The declarative approach in Lakeflow Declarative Pipelines removes traditional complexity around change data capture, incremental computation, and result caching. This allows data engineers to focus solely on analytical logic when creating views for dashboards, Databricks applications, or any other downstream use case.

AI/BI Genie: Natural Language for Real-Time Insights

More data often creates new organizational challenges. Despite real-time data availability, only technical data engineering teams usually modify pipelines, so analytical business teams depend on engineering resources for ad hoc analysis.

AI/BI Genie enables natural language queries against streaming data for everyone. Non-technical users can ask questions in plain English, and queries are automatically translated to SQL against real-time data sources. The transparency of being able to verify the generated SQL provides crucial safeguards against AI hallucination while also maintaining query performance and governance standards.

Behind the scenes, Genie uses agentic reasoning to understand your questions while following Unity Catalog access rules. It asks for clarification when uncertain and learns your business terms through example queries and instructions.

For example, “How many unique flights are currently tracked?” is internally translated to SELECT COUNT(DISTINCT icao24) FROM ingest_flights. The magic is that you don’t need to know any column names in your natural language request.

Another command, “Plot altitude vs. velocity for all aircraft,” generates a visualization showing the correlation of speed and altitude. And “plot the locations of all planes on a map” illustrates the spatial distribution of the avionics events, with altitude represented through color coding.

This capability is compelling for real-time analytics, where business questions often emerge rapidly as conditions change. Instead of waiting for engineering resources to write custom queries with complex temporal window aggregations, domain experts explore streaming data directly, discovering insights that drive immediate operational decisions.

Visualize Data in Realtime

Once your data is available as Delta or Iceberg tables, you can use virtually any visualization tool or graphics library. For example, the visualization shown here was created using Dash, running as a Lakehouse Application with a timelapse effect.

This approach demonstrates how modern data platforms not only simplify data engineering but also empower teams to deliver impactful insights visually in real time.

7 Lessons Learned about the Future of Data Engineering

Implementing this real-time avionics pipeline taught me fundamental lessons about modern streaming data architecture.

These seven insights apply universally: streaming analytics becomes a competitive advantage when accessible through natural language, when data engineers focus on business logic instead of infrastructure, and when AI-powered insights drive immediate operational decisions.

1. Custom PySpark Data Sources Bridge the Gap
PySpark custom data sources fill the gap between Lakeflow’s managed connectors and Spark’s technical connectivity. They encapsulate API complexity into reusable components that feel native to Spark developers. While implementing such connectors isn’t trivial, Databricks Assistant and other AI helpers provide enough valuable guidance in the development process.

Not many people have been writing about this or even using it, but PySpark Custom Data Sources open many possibilities, from better benchmarking to improved testing to more comprehensive tutorials and exciting conference talks.

2. Declarative Accelerates Development
Using the new Declarative Pipelines with a PySpark data source, I achieved remarkable simplicity—what looks like a code snippet is the complete implementation. Writing fewer lines of code isn’t just about developer productivity but operational reliability. Declarative pipelines eliminate entire classes of bugs around state management, checkpointing, and error recovery that plague imperative streaming code.

3. The Lakehouse Architecture Simplifies
The Lakehouse brought everything together—data lakes, warehouses, and all the tools—in one place.

During development, I could quickly switch between building ingestion pipelines, running analytics in DBSQL, and visualizing results with AI/BI Genie or Databricks Apps using the same tables. My workflow became seamless with Databricks Assistant, which is always everywhere, and the ability to deploy real-time visualizations right on the platform.

What began as a data platform became my complete development environment, with no more context switching or tool juggling.

4. Visualization Flexibility is Key
Lakehouse data is accessible to a wide range of visualization tools and approaches—from classic notebooks for quick exploration, to AI/BI Genie for instant dashboards, to custom web apps for rich, interactive experiences. For a real-world example, see how I used Dash as a Lakehouse Application earlier in this post.

5. Streaming Data Becomes Conversational
For years, accessing real-time insights required deep technical expertise, complex query languages, and specialized tools that created barriers between data and decision-makers.

Now you can ask questions with Genie directly against live data streams. Genie transforms streaming data analytics from a technical challenge into a simple conversation.

6. AI Tooling Support is a Multiplier
Having AI assistance integrated throughout the lakehouse fundamentally changed how quickly I could work. What impressed me most was how the Genie learned from the platform context.

AI-supported tooling amplifies your skills. Its true power is unlocked when you have a strong technical foundation to build.

7. Infrastructure and Governance Abstractions Create Business Focus
When the platform handles operational complexity automatically—from scaling to error recovery—teams can concentrate on extracting business value rather than fighting technology constraints. This shift from infrastructure management to business logic represents the future of streaming data engineering.

TL;DR The future of streaming data engineering is AI-supported, declarative, and laser-focused on business outcomes. Organizations that embrace this architectural shift will find themselves asking better questions of their data and building more solutions faster.

Do you want to learn more?

Get Hands-on!

The complete flight tracking pipeline can be run on the Databricks Free Edition, making Lakeflow accessible to anyone with just a few simple steps outlined in our GitHub repository.