Stress Testing Supply Chain Networks at Scale on Databricks

Introduction

In the recent trade war, governments have weaponized commerce through cycles of retaliatory tariffs, quotas, and export bans. The shockwaves have rippled across supply chain networks and forced companies to reroute sourcing, reshore production, and stockpile critical inputs—measures that extend lead times and erode once-lean, just-in-time operations. Each detour carries a cost: rising input prices, increased logistics expenses, and excess inventory tying up working capital. As a result, profit margins shrink, cash-flow volatility increases, and balance-sheet risks intensify.

Was the trade war a singular event that caught global supply chains off guard? Perhaps in its specifics, but the magnitude of disruption was hardly unprecedented. Over the span of just a few years, the COVID-19 pandemic, the 2021 Suez Canal blockage, and the ongoing Russo-Ukrainian war each delivered major shocks, occurring roughly a year apart. These events, difficult to foresee, have caused substantial disruption to global supply chains.

What can be done to prepare for such disruptive events? Instead of reacting in panic to last-minute changes, can companies make informed decisions and take proactive steps before a crisis unfolds? A well-cited paper by MIT professor David Simchi-Levi offers a compelling, data-driven approach to this challenge. At the core of his method is the creation of a digital twin—a graph-based model where nodes represent sites and facilities in the supply chain, and edges represent the flow of materials between them. A wide range of disruption scenarios is then applied to the network, and its responses are measured. Through this process, companies can assess potential impacts, uncover hidden vulnerabilities, and identify redundant investments.

This process, known as stress testing, has been widely adopted across industries. Ford Motor Company, for example, applied this approach across its operations and supply network, which includes over 4,400 direct supplier sites, hundreds of thousands of lower-tier suppliers, more than 50 Ford-owned facilities, 130,000 unique parts, and over $80 billion in annual external procurement. Their analysis revealed that approximately 61% of supplier sites, if disrupted, would have no impact on profits—while about 2% would have a significant impact. These insights fundamentally reshaped their approach to supply chain risk management.

The remainder of this blog post provides a high-level overview of how to implement such a solution and perform a comprehensive analysis on Databricks. The supporting notebooks are open-sourced and available here.

Stress Testing Supply Chain Networks on Databricks

Imagine a scenario where we are working for a global retailer or a consumer goods company and tasked with enhancing supply chain resiliency. This specifically means ensuring that our supply chain network can meet customer demand during future disruptive events to the fullest extent possible. To achieve this, we must identify vulnerable sites and facilities within the network that could cause disproportionate damage if they fail and reassess our investments to mitigate the associated risks. Identifying high-risk locations also helps us recognize low-risk ones. If we uncover areas where we are overinvesting, we can either reallocate those resources to balance risk exposure or reduce unnecessary costs.

The first step toward achieving our goal is to construct a digital twin of our supply chain network. In this model, supplier sites, production facilities, warehouses, and distribution centers can be represented as nodes in a graph, while the edges between them capture the flow of materials throughout the network. Creating this model requires operational data such as inventory levels, production capacities, bills of materials, and product demand. By using these data as inputs to a linear optimization program—designed to optimize a key metric such as profit or cost—we can determine the optimal configuration of the network for that given objective. This enables us to identify how much material should be sourced from each sub-supplier, where it should be transported, and how it should move through to production sites to optimize the selected metric—a supply chain optimization approach widely adopted by many organizations. Stress testing goes a step further—introducing the concepts of time-to-recover (TTR) and time-to-survive (TTS).

Visualization of the digital twin of a multi-tier supply chain network.

Time-to-recover (TTR)

TTR is one of the key inputs to the network. It indicates how long a node—or a group of nodes—takes to recover to its normal state after a disruption. For example, if one of your supplier’s production sites experiences a fire and becomes non-operational, TTR represents the time required for that site to resume supplying at its previous capacity. TTR is typically obtained directly from suppliers or through internal assessments.

With TTR in hand, we begin simulating disruptive scenarios. Under the hood, this involves removing or limiting the capacity of a node—or a set of nodes—affected by the disruption and allowing the network to re-optimize its configuration to maximize profit or minimize cost across all products under the given constraints. We then assess the financial loss of operating under this new configuration and calculate the cumulative impact over the duration of the TTR. This gives us the estimated impact of the specific disruption. We repeat this process for thousands of scenarios in parallel using Databricks’ distributed computing capabilities.

Below is an example of an analysis performed on a multi-tier network producing 200 finished goods, with materials sourced from 500 tier-one suppliers and 1000 tier-two suppliers. Operational data were randomly generated within reasonable constraints. For the disruptive scenarios, each supplier node was removed individually from the graph and assigned a random TTR. The scatter plot below displays total spend on supplier sites for risk mitigation on the vertical axis and lost profit on the horizontal axis. This visualization allows us to quickly identify areas where risk mitigation investment is undersized relative to the potential damage of a node failure (red box), as well as areas where investment is oversized compared to the risk (green box). Both regions present opportunities to revisit and optimize our investment strategy—either to enhance network resiliency or to reduce unnecessary costs.

Analysis of risk mitigation spend vs. potential profit loss, indicating areas of over- & under-investment

Time-to-survive (TTS)

TTS offers another perspective on the risk associated with node failure. Unlike TTR, TTS is not an input but an output—a decision variable. When a disruption occurs and impacts a node or a group of nodes, TTS indicates how long the reconfigured network can continue fulfilling customer demand without any loss. The risk becomes more pronounced when TTR is significantly longer than TTS.

Below is another analysis conducted on the same network. The histogram shows the distribution of differences between TTR and TTS for each node. Nodes with a negative TTR − TTS are generally not a concern—assuming the provided TTR values are accurate. However, nodes with a positive TTR − TTS may incur financial loss, especially those with a large gap. To enhance network resiliency, we shall reassess and potentially reduce TTR by renegotiating terms with suppliers, increase TTS by building inventory buffers, or diversify the sourcing strategy.

Analysis of nodes focused on time to recover (TTR) relative to time until disruption incurs downstream losses (TTS)

By combining TTR and TTS analysis, we can gain a deeper understanding of supply chain network resiliency. This exercise can be conducted strategically on a yearly or quarterly basis to inform sourcing decisions, or more tactically on a weekly or daily basis to monitor fluctuating risk levels across the network—helping to ensure smooth and responsive supply chain operations.

On a lightweight four-node cluster, the TTR and TTS analyses completed in 5 and 40 minutes respectively on the network described above (1,700 nodes)—all for under $10 in cloud spend. This highlights the solution’s impressive speed and cost-effectiveness. However, as supply chain complexity and business requirements grow—with increased variability, interdependencies, and edge cases—the solution may require greater computational power and more simulations to maintain confidence in the results.

Why Databricks

Every data-driven solution relies on the quality and completeness of the input dataset—and stress testing is no exception. Companies need high-quality operational data from their suppliers and sub-suppliers, including information on bills of materials, inventory, production capacities, demand, TTR, and more. Collecting and curating this data is not trivial. Moreover, building a transparent and flexible stress-testing framework that reflects the unique aspects of your business requires access to a wide range of open-source and third-party tools—and the ability to select the right combination. In particular, this includes LP solvers and modeling frameworks. Finally, the effectiveness of stress testing hinges on the breadth of the disruption scenarios considered. Running such a comprehensive set of simulations demands access to highly scalable computing resources.

Databricks is the ideal platform for building this type of solution. While there are many reasons, the most important include:

Delta Sharing: Access to up-to-date operational data is essential for developing a resilient supply chain solution. Delta Sharing is a powerful capability that enables seamless data exchange between companies and their suppliers—even when one party is not using the Databricks platform. Once the data is available in Databricks, business analysts, data engineers, data scientists, statisticians, and managers can all collaborate on the solution within a unified, data intelligent platform.
Open Standards: Databricks integrates seamlessly with a broad range of open-source and third-party technologies, enabling teams to leverage familiar tools and libraries with minimal friction. Users have the flexibility to define and model their own business problems, tailoring solutions to specific operational needs. Open-source tools provide full transparency into their internals—crucial for auditability, validation, and continuous improvement—while proprietary tools may offer performance advantages. On Databricks, you have the freedom to choose the tools that best suit your needs.
Scalability: Solving optimization problems on networks with thousands of nodes is computationally intensive. Stress testing requires running simulations across tens of thousands of disruption scenarios—whether for strategic (yearly/quarterly) or tactical (weekly/daily) planning—which demands a highly scalable platform. Databricks excels in this area, offering horizontal scaling to efficiently handle complex workloads, powered by strong integration with distributed computing frameworks such as Ray and Spark.

Summary

Global supply chains often lack visibility into network vulnerabilities and struggle to predict which supplier sites or facilities would cause the most damage during disruptions—leading to reactive crisis management. In this article, we presented an approach to build a digital twin of the supply chain network by leveraging operational data and running stress testing simulations that evaluate Time-to-Recover (TTR) and Time-to-Survive (TTS) metrics across thousands of disruption scenarios on Databricks’ scalable platform. This method enables companies to optimize risk mitigation investments by identifying high-impact, vulnerable nodes—similar to Ford’s discovery that only a small fraction of supplier sites significantly affect profits—while avoiding overinvestment in low-risk areas. The result is preserved profit margins and reduced supply chain costs.

Databricks is ideally suited for this approach, thanks to its scalable architecture, Delta Sharing for real-time data exchange, and seamless integration with open-source and third-party tools for transparent, flexible, efficient and cost-effective supply chain modeling. Download the notebooks to explore how stress testing of supply chain networks at scale can be implemented on Databricks.

Stress Testing Supply Chain Networks at Scale on Databricks

Introduction

Stress Testing Supply Chain Networks on Databricks

Time-to-recover (TTR)

Time-to-survive (TTS)

Why Databricks

Summary

Related Posts

Exploratory Data Analysis: Gamma Spectroscopy in Python (Part 2)

The Hidden Trap of Fixed and Random Effects

Leave a Reply Cancel reply