Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 3 Networking

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍌 Bananas are berries, but strawberries are not.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for AI Infra Networking: GPU Clusters, InfiniBand, RoCE, and DPU Integration

AI Infra Networking: GPU Clusters, InfiniBand, RoCE, and DPU Integration

Fundamental concepts and technologies for networking in AI-centric data centers, including GPU interconnects (NVLink, NVSwitch), high-speed networking (InfiniBand, RoCE), and the role of DPUs (Data Processing Units) in accelerating AI workloads and managing network traffic.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Thu Feb 19 2026

Share This on

Networking in an AI-Centric Data Center

AI workloads require:

  • Ultra-low latency
  • High bandwidth
  • Deterministic performance
  • Scalability across nodes

Networking must support:

  • GPU-to-GPU communication
  • Storage access
  • Cluster management
  • Infrastructure monitoring

Distributed training requires:

  • High bandwidth
  • Low latency
  • Efficient collective communication
  • Uses:
    • NCCL
    • RDMA
    • InfiniBand
    • NVLink

Latency

Time taken for a single data transfer.

Important for:

  • Real-time inference
  • Synchronization

Throughput

Total data transferred per second.

Important for:

  • Large distributed training
  • Checkpointing
  • Dataset streaming

Network Separation

AI data centers use separate network planes.

1. Compute Network

  • GPU-to-GPU communication
  • Used for training & distributed workloads
  • Technologies:
    • InfiniBand
    • RoCE (RDMA over Converged Ethernet)
    • NVLink (inside node)
  • Priority: Ultra-low latency & high throughput

2. In-Band Management - Network

  • Ultra-low latency
  • Lower bandwidth, higher latency than compute fabric
  • Critical for cluster operations and monitoring
  • Technologies:
    • SSH
    • Job scheduling (Slurm)
    • Kubernetes traffic
    • DNS, cluster APIs
  • Priority: Reliability and availability

3. Out-of-Band Management Network

  • Always available even when server is offline
  • Remote power control
  • Remote console (IPMI, Redfish)
  • Priority: Always-on access for management and recovery

4. Storage Network

  • High throughput for dataset access and checkpointing
  • Technologies:
    • NVMe-oF (NVMe over Fabrics)
    • Parallel file systems (Lustre, BeeGFS)
  • Often uses RDMA for low latency
  • Priority: High bandwidth and low contention

DMA (Direct Memory Access)

Direct memory access without CPU copying data.

  • Bypasses CPU for data transfer
  • Reduces latency
  • Increases throughput
  • Used in GPU interconnects and storage access
  • Enables GPUDirect for efficient data movement
  • Critical for high-performance AI workloads
  • Supports zero-copy transfers between GPU and network/storage

RDMA (Remote Direct Memory Access)

  • Across servers Direct GPU memory access over network
  • Memory access across hosts

Traditional Networking

CPU handles:

  • Packet processing
  • Memory copying
  • Interrupts

RDMA

  • Bypasses CPU
  • Direct memory access across hosts
  • Reduces latency
  • Reduces CPU utilization
  • Increases throughput

InfiniBand vs Ethernet

1. Ethernet

  • General-purpose networking widely used
  • Higher latency (~10–100 µs typical)
  • Uses TCP/IP stack
  • Commodity hardware
  • Widely supported

2. InfiniBand

High throughput ,low latency with low CPU overhead for connecting to Storage

  • Ultra-low latency (1–2 µs)
  • Uses Native RDMA Stack to access remote memory directly without CPU involvement
  • Used in large HPC / AI clusters: over 50% HPC clusters use InfiniBand
  • HCA (Infiniband Network Interface Cards): allows hardware offload of RDMA operations
  • Managed by Open Subnet Manager (SM).

Examples:

  • NVIDIA Quantum-X 800 Infiniband switch for high-performance InfiniBand-based AI

3. RDMA over Converged Ethernet (RoCE)

RDMA + Ethernet: Enables RDMA over Ethernet

  • Open source alternative to InfiniBand
  • More flexible than infiniBand
  • Cheaper Enterprise-friendly
  • Used in enterprise AI clusters

NVIDIA hardware:

  • Spectrum switches (Ethernet) + BlueField DPUs support RoCE for high-performance Ethernet-based AI clusters
  • Nvidia Quantum-X 800 Infiniband switch for high-performance InfiniBand-based AI clusters

GPU Interconnects (Compute Fabric)

1. PCIe

  • Standard connection
  • Higher latency
  • Limited bandwidth (16–32 GB/s)
  • Not ideal for multi-GPU scaling

2. NVLink Chip-to-chip interconnect

GPU to GPU inside same node → NVLink

  • High-speed GPU-to-GPU communication inside server
  • Up to 600 GB/s
  • Faster than PCIe
  • Enables multi-GPU scaling
  • Uses NVSwitch for scale

3. NVSwitch Fabric

Connects multiple GPUs in large systems

  • Enables full bandwidth communication across large GPU arrays
  • Used in DGX SuperPOD to connect 8× H100 GPUs
  • Provides non-blocking, high-speed interconnect for large multi-GPU systems

4. GPUDirect RDMA

GPU to GPU across nodes → GPUDirect RDMA

  • Works across hosts
  • GPU-to-GPU or GPU-to-NIC
  • data transfer for HPC, AI clusters
  • No CPU involvement = Ultra-low latency

5. GPUDirect Storage

Storage device ↔ GPU memory

  • Works within a host
  • High-throughput GPU data loading from NVMe/RAID/parallel storage
  • Avoids system memory bottleneck by bypasses OS, system memory & CPU
  • High-bandwidth I/O for feeding large datasets into GPUs

GPU & Network Management with Kubernetes Operators

NVIDIA GPU Operator

Open source Kubernetes operator for managing NVIDIA GPU resources in Kubernetes clusters.

  • Simplifies GPU management in Kubernetes environments, enabling seamless deployment of AI workloads on GPU-accelerated Kubernetes clusters.

  • Sits on Top of Kubernetes to ensures GPU resources are available and properly configured for AI workloads running in Kubernetes

  • Supports both on-prem and cloud Kubernetes clusters

  • Automates deployment and management of NVIDIA GPU drivers, device plugins, and monitoring components in Kubernetes

    • NVIDIA Driver installation and updates
    • NVIDIA Container Runtime support for GPU-accelerated containers
    • **NVIDIA K8 Device Plugin ** for exposing GPU resources to containers
    • GPU resource monitoring and metrics collection

GPU Operator

NVIDIA Network Operator

Kubernetes operator for managing NVIDIA network resources in Kubernetes clusters.

  • Manages NVIDIA network resources (InfiniBand, RoCE) in Kubernetes environments

  • Ensures high-performance networking for AI workloads in Kubernetes clusters

  • Works along with GPU Operator to provide comprehensive GPU + network management for AI workloads in Kubernetes

  • Network Operator Components:

    • MLNX OFED Drivers: Installs and manages Mellanox OFED drivers for InfiniBand/RoCE support
    • K8 RDMA Shared Device Plugin: Exposes RDMA network resources to Kubernetes workloads
    • NVIDIA Peer Memory Driver: Enables GPU peer-to-peer memory access across nodes for high-performance networking
AI-Infra/3-Networking
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.