Home » Re-Engineering Ethernet for AI Fabric

Re-Engineering Ethernet for AI Fabric

[SPONSORED GUEST ARTICLE]   For years, InfiniBand has been the go-to networking technology for high-performance computing (HPC) and AI workloads due to its low latency and lossless transport. But as AI clusters grow to thousands of GPUs and demand open, scalable infrastructure, the industry is shifting.

Leading AI infrastructure providers are increasingly moving from proprietary InfiniBand to Ethernet – driven by cost, simplicity, and ecosystem flexibility. However, traditional Ethernet lacks one critical capability: deterministic, lossless performance for AI workloads.

Why Traditional Ethernet Falls Short

Ethernet wasn’t built with AI in mind. While cost-effective and ubiquitous, its best-effort, packet-based nature creates major challenges in AI clusters:

  • Latency Sensitivity: Distributed AI training is highly sensitive to jitter and latency. Standard Ethernet offers no guarantees, often causing performance variability.
  • Congestion: Concurrent AI jobs and large-scale parameter updates lead to head-of-line blocking, congestion, and unpredictable packet drops.

Fabric-Scheduled Ethernet for AI

Fabric-scheduled Ethernet transforms Ethernet into a predictable, lossless, scalable fabric – ideal for AI. It uses cell spraying and virtual output queuing (VOQ) to build a scheduled fabric that delivers high performance while retaining Ethernet’s openness and cost benefits.

How It Works: Cell Spraying + VOQ = Scheduling

Cell Spraying: Load Distribution

Instead of sending large packets, DriveNets’ Network Cloud-AI breaks data into fixed-size cells and sprays them across multiple paths. This avoids overloading any single link, even during bursts, and eliminates “elephant flows” that often choke traditional Ethernet.

Benefits of cell spraying:

  • Smooths out traffic peaks via perfect load balancing
  • Ensures predictable latency
  • Avoids congestion hotspots

Virtual Output Queuing (VOQ): No More Head-of-Line Blocking

In traditional Ethernet switches, one congested port can block others, wasting bandwidth. VOQ fixes this by assigning a dedicated queue for each output port at each ingress port.

This ensures traffic is queued exactly where needed. The scheduler can then make intelligent, per-destination forwarding decisions. Combined with cell spraying, this guarantees fairness and isolation between traffic flows — critical for synchronized AI workloads.

End-to-End VOQ: Traffic Consistency

End-to-end VOQ provides consistent service across the network. Each virtual queue corresponds to a specific traffic flow, and packets transmit only when delivery is guaranteed.

A credit-based flow-control mechanism ensures queues don’t overflow. When a packet is sent, the switch grants a credit to the source, indicating how many more packets it can send. This prevents packet loss and ensures fair access, even in congestion.

Scheduled Fabric: Lossless Ethernet for AI

At the core of Network Cloud-AI is a scheduled fabric built on DriveNets’ Distributed Disaggregated Chassis architecture, enabling centralized control and data scheduling.

Rather than relying on reactive congestion controls like ECN or PFC, DriveNets proactively calculates optimal transmission schedules. Each cell knows precisely when and where to go — enabling deterministic, lossless transport.

Why It Matters for AI

AI training performance scales linearly only when the network matches GPU speed. Network Cloud-AI eliminates delays and inconsistencies that slow training.

Results:

  • Higher GPU utilization
  • Faster training and reduced cost
  • Seamless scaling to thousands of GPUs

Crucially, this is all built on standard Ethernet hardware — avoiding vendor lock-in and high proprietary costs.

Highest-Performance Ethernet for AI

DriveNets Network Cloud-AI redefines Ethernet for the AI era. By combining cell spraying, VOQ, and fabric scheduling, it delivers the deterministic, lossless performance required for high-end HPC and AI networks — all while preserving Ethernet’s openness and flexibility.

Learn more in our upcoming webinar: Insights from deploying an Ethernet-based GPU cluster fabric

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *