Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 2 Compute

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍯 Honey never spoils — archaeologists found 3,000-year-old jars still edible.
Cover Image for AI Infra Computing : GPU, DPU, Virtualization, DGX Systems

AI Infra Computing : GPU, DPU, Virtualization, DGX Systems

Comprehensive overview of modern AI infrastructure covering CPU, GPU, and DPU architectures, accelerated computing models, cluster scaling, high-speed networking (InfiniBand and RoCE), storage integration, and power and cooling considerations for AI data centers.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

NVIDIA AI Infrastructure and Operations Fundamentals

Next →

AI Infra Networking: GPU Clusters, InfiniBand, RoCE, and DPU Integration

Computation

1. CPU (Central Processing Unit)

  • Few powerful cores (4–64+)
  • Low latency decision-making
  • Optimized for sequential tasks
  • Handles OS, orchestration, control logic

Best for:
Operating systems, control plane, general-purpose apps

2. DPU (Data Processing Unit)

Offload, isolate, Accelerate Infrastructure Tasks

  • A specialized processor designed to handle data-centric tasks, freeing up CPU/GPU resources for AI workloads.
  • Offloads networking, storage, security:
    • packet processing
    • firewalling
    • encryption offload
  • Enables zero-trust isolation
  • Improves isolation and frees CPU/GPU resources

Best for:
Infrastructure offload, zero-trust environments

NVIDIA BlueField Platform

  • BlueField architecture melds a NIC subsystem (based on ConnectX) with a programmable data path, hardware accelerators for cryptography, compression, and reg-ex, and an Arm complex for the control plane.

Doca

DOCA: NVIDIA BlueField DPU Architecture

Much like CUDA abstracts GPU programming, DOCA abstracts DPU programming to a higher level.

Doca-Stack

3. GPU (Graphics Processing Unit)

  • Thousands of simple cores
  • Massive parallelism
  • High memory bandwidth & High throughput
  • AI training & inference

Best for:
AI training, inference, simulations, rendering

Nvidia Certified Systems:

  • Validates Bast Baseline Configuration for AI workloads
  • Check for: Manageability, Scalability, Performance, Security

Virtualization – MIG vs vGPU

vGPU(Virtual GPU)

vGPU → VM environments

  • Software-based partition
  • Hypervisor controlled
  • Up to 64 partitions (depends on GPU)
  • Used in VMs, VDI

MIG(Multi-Instance GPU)

MIG → containers / bare-metal

  • Hardware-level partition
  • Max 7 instances
  • Predictable performance
  • Used for AI workloads

If question says:

  • Strong isolation
  • Bare metal Linux
  • AI multi-tenant training Answer → MIG.

GPU Architecture

Core Types

1. CUDA Cores

  • General-purpose parallel computation
  • Graphics + AI workloads

2. Tensor Cores

  • Accelerate AI matrix operations
  • Mixed precision compute
  • Used in deep learning training/inference

3. RT Cores

  • Ray tracing acceleration
  • Graphics realism

GPU Architecture based on Workload

  1. Blackwell GPU (B200/300) - AI LLM training and inference, GenAI, AI Resoning
  2. Hopper GPU (H100, H200) - Data Analytics, Conversational AI, Language Processing
  3. Ada Lovelace GPU (L4OS, RTX 40 series) - Gaming, Ray Tracing, Visualization, AI Powered Graphics
  4. Grace CPU - Arm CPU designed by NVIDIA for AI workloads, paired with H100 in DGX H100
    • Grace Hopper Superchip (GB): Combines Grace CPU and H100 GPU for high-performance AI compute
    • Grace Blackwell Superchip(GB200/ GB300): Combines Grace CPU and Blackwell GPU for AI training and inference
    • Nvidia Grace Superchip: 2x Grace CPUs for CPU-only workloads, optimized for AI data processing and analytics

GPU Families (Compute Perspective)

1. RTX Series

  • Workstations and visualization
  • Not ideal for massive AI training

2. Data Center GPUs

  • V100, A100, H100, H200
  • Designed for AI, HPC, large-scale compute

NVIDIA DGX (Deep Learning System ) Compute Platforms

Ready to use platform designed specifically for AI, ML, DL workloads.

DGX OS

NVIDIA's customized Ubuntu 22.04-based operating system for DGX Systems.

  • NVIDIA-Optimized Kernel for AI, ML, and analytics workloads on DGX systems.
  • Latest version DGX OS 7.2.1 released on August 19, 2025
  • Comprehensive Software Stack: Includes NVIDIA GPU drivers, CUDA Toolkit, cuDNN, NCCL, DCGM, Docker Engine, and more.

NVIDIA DGX A100

Legacy DGX system for AI training and inference

  • 8x A100 GPUs
  • 200 Gb/s networking
  • ~6.5 kW power

DGX Spark

Desktop version of DGX for AI development and prototyping

  • Similar to MacMini but with 4x A100 GPUs

NVIDIA DGX System : H100/H200/ B200/B300

8x H100 GPUs

  • 400 Gb/s networking
  • ~10.2 kW power
  • More memory

NVIDIA DGX SuperPOD

Cluster of DGX systems

  • 8× H100 GPUs
  • Exascale-class AI compute
  • Foundation model training

NVIDIA DGX GB200/ GB300

  • 36x Grace CPU + 72 B300 GPUs
  • 130 Tb/s NVLink interconnect
  • Liquid cooling
  • Designed for large-scale AI training and inference workloads

Typical question:

Which solution is suitable for training massive LLMs across multiple racks? Answer: DGX SuperPOD.


NVIDIA AI Platform

Cloud offers pre-configured environments for AI workloads, including NVIDIA VMI (Virtual Machine Image) and NVIDIA AI Enterprise software suite.

1 Accelerated Infra Layer

1.1 NVIDIA VMI(Virtual Machine Image)

  • Pre-configured VM images with NVIDIA drivers, CUDA, cuDNN, and AI frameworks
  • Available on AWS, Azure, Google Cloud

2 AI Platforms Software Layer

2.1. NVIDIA AI Enterprise

Software suite for AI workloads in enterprise environments

  • Includes AI frameworks, tools, and support for NVIDIA GPUs
  • 50+ AI frameworks and tools, including TensorFlow, PyTorch, RAPIDS
  • Pretrained models and SDKs for computer vision, NLP, recommendation systems
  • Optimized for VMware vSphere and NVIDIA-Certified Systems

3. NVIDIA AI Cloud Services

  • NVIDIA AI Cloud Services: Managed services for AI workloads, including model training, deployment, and monitoring on NVIDIA GPUs in the cloud.

NVIDIA DGX Cloud

A cloud-based platform for deploying and managing AI applications on NVIDIA GPUs, with support for Kubernetes and containerized workloads.

  • Provides access to NVIDIA GPU resources in the cloud for AI training and inference

DGX-Stack

Nvidia AI Foundry

A cloud-based platform for building, training, and deploying AI models using NVIDIA GPUs, with support for popular frameworks and tools.

  • Service that enables enterprises to use data, accelerated computing and software tools to create and deploy custom models that can supercharge their generative AI initiatives.

NVIDIA AI Foundation

A collection of pretrained models, SDKs, and tools for accelerating AI development across various domains

  • Provides access to pretrained models for computer vision, NLP, recommendation systems, and more
  • Includes SDKs for integrating these models into applications

NVIDIA NeMo

A comprehensive software suite to build, monitor, and optimize AI agents across their lifecycle at enterprise scale.

It provides microservices and toolkits for

  • data processing
  • model fine-tuning and evaluation
  • reinforcement learning
  • policy enforcement
  • system observability

AI Foundry with NeMo

AI-Infrastructure/2-Compute
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.