Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

AI-Infrastructure

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-Infrastructure

AI Programming Model

Overview of NVIDIA's AI programming model, including core libraries (CUDA, NCCL, cuDNN), training vs inference workloads, and compute scaling models (data parallelism and model parallelism) for AI infrastructure.

NVIDIA

AI Infrastructure

GPU Clusters

Data Center

AI Training

AI Networking

← Previous

AI Infra Computing : GPU, DPU, Virtualization, DGX Systems

AI Infra Networking: GPU Clusters, InfiniBand, RoCE, and DPU Integration

📟 `CUDA` (Compute Unified Device Architecture)

Parallel computing platform enabling GPU programming.

Thousands of parallel threads
Native C/C++/Python integration
General-purpose GPU computing

CUDA parallel model:

Break problem into small identical tasks
Launch thousands of threads (workers) to do them simultaneously, Collect results when everyone finishes

Before CUDA, programmers had to use graphics APIs (OpenGL, DirectX) to leverage GPU power for non-graphics tasks, which was complex and inefficient. CUDA provided a more direct and flexible way to program GPUs for general-purpose computing.

graph TD
    A[CPU] -->| Graphics API ✨ | B[GPU 🧮]
    B -->|Limited Access| C[Data Processing]
    C -->|Results| A

With CUDA, developers can write code that runs on the GPU, allowing for significant performance improvements in tasks that can be parallelized, such as matrix operations, simulations, and deep learning workloads.

graph TD
    A[CPU] -->|CUDA API 📟| B[GPU 🧮]
    B -->|Parallel Threads 🪡| C[Data Processing]
    C -->|Results| A

🪡 Kernels and Threads

A kernel is the unit of CUDA code that programmers typically write and compose, akin to a procedure or function in languages targeting CPUs.

A kernel is a function that runs on the GPU, executed by many threads in parallel.
Each thread executes the kernel with a unique thread ID, allowing for data parallelism.

__global__ void myKernel(int *data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x; // Calculate global thread ID
    data[idx] = data[idx] * 2; // Example operation
}

CUDA provides a hierarchical execution model where threads are organized into blocks, and blocks are organized into a grid. This allows for scalable parallelism across a wide range of GPU architectures.

1. 🧮 Graphical Processing Clusters (`GPCs`)

Roughly analogous to the cores of CPUs or workshop handing a load of workload.

GPUs are organized into Graphics Processing Clusters (GPCs), which contain multiple Streaming Multiprocessors (SMs) and other components.
GPCs manage the distribution of work across SMs and handle tasks such as scheduling and memory management, ensuring efficient execution of parallel workloads on the GPU.

Example: Nvidia GTX 1080 has 4 GPCs, each containing 5 SMs of 128 CUDA cores, for a total of 2560 CUDA cores.

Each GPC contains several SMs

2. 🥓 Streaming Multiprocessors (`SMs`)

Highly parallel processing units within the GPU that execute threads in warps.

SMs manage the execution of threads, including scheduling, synchronization, and memory access, allowing for efficient parallel processing of tasks on the GPU.

Each SM then can have multiple Warps & 1 Ray Tracing Unit (RTU) for ray tracing workloads

2.1 🔆 Ray Tracing Units (RTUs)

Specialized hardware for accelerating ray tracing workloads, which are common in graphics rendering and increasingly used in AI applications for tasks like 3D modeling and simulation.

RTUs can perform ray tracing calculations much faster than traditional CUDA cores, enabling real-time ray tracing in graphics applications and accelerating ray tracing-based algorithms in AI workloads.
RTUs are designed to handle the complex calculations involved in ray tracing, such as intersection tests and shading computations, which can be computationally intensive when performed on general-purpose CUDA cores.

2.2 🧶 Warps

Group of 32 threads scheduled and executed together on an SM, in lockstep

Threads in a warp execute the same instruction simultaneously Single Instruction, Multiple Data (SIMD)
- Divergence (different instructions) causes serialization and performance loss
- Design algorithms to minimize divergence for optimal performance
The number of CUDA Cores and Tensor Cores per SM is not fixed at "32 + 1" — it varies by GPU architecture (e.g. Volta has 8 Tensor Cores per SM, Ampere/Hopper have 4); a warp's 32 threads are scheduled onto whichever cores the SM's partition provides

2.2.1 🪡 CUDA Cores

General-purpose cores for executing a wide range of parallel tasks, including graphics rendering and general computeing workloads.

Perform standard floating-point and integer operations, making them suitable for a variety of applications beyond graphics, such as scientific computing, machine learning, and data processing.

Addition, multiplication,
Bitwise operations,

$a \times b + b = d$

2.2.2 🧵 Tensor Cores

Specialized cores designed for accelerating matrix operations, particularly in deep learning workloads.

Perform mixed-precision matrix multiply and accumulate operations, which are common in deep learning algorithms.

$D = A \times B + C$

Where A, B, C, and D are matrices. Tensor Cores can perform this operation much faster than traditional CUDA cores, especially when using lower precision formats like FP16 or INT8, which are often sufficient for deep learning tasks.

 GPCs 🧮
  |--> 🥓 SMs
     |--> 🔆 Ray Tracing Units (RTUs)
     |--> 🧶 Warps
         |--> 🪡 CuDA Cores
         |--> 🧵 Tensor Cores

cuDNN (CUDA Deep Neural Network library)

GPU-accelerated library for deep learning primitives.

Provides highly tuned implementations for standard routines such as:

forward and backward convolution
- attention
- matmul
- pooling
- normalization.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

AI Infra Computing : GPU, DPU, Virtualization, DGX Systems

AI Infra Networking: GPU Clusters, InfiniBand, RoCE, and DPU Integration

AI-Infrastructure/2-0-CUDA

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

AI-Infrastructure

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-Infrastructure

AI Programming Model

Overview of NVIDIA's AI programming model, including core libraries (CUDA, NCCL, cuDNN), training vs inference workloads, and compute scaling models (data parallelism and model parallelism) for AI infrastructure.

NVIDIA

AI Infrastructure

GPU Clusters

Data Center

AI Training

AI Networking

← Previous

AI Infra Computing : GPU, DPU, Virtualization, DGX Systems

AI Infra Networking: GPU Clusters, InfiniBand, RoCE, and DPU Integration

📟 `CUDA` (Compute Unified Device Architecture)

Parallel computing platform enabling GPU programming.

Thousands of parallel threads
Native C/C++/Python integration
General-purpose GPU computing

CUDA parallel model:

Break problem into small identical tasks
Launch thousands of threads (workers) to do them simultaneously, Collect results when everyone finishes

graph TD
    A[CPU] -->| Graphics API ✨ | B[GPU 🧮]
    B -->|Limited Access| C[Data Processing]
    C -->|Results| A

graph TD
    A[CPU] -->|CUDA API 📟| B[GPU 🧮]
    B -->|Parallel Threads 🪡| C[Data Processing]
    C -->|Results| A

🪡 Kernels and Threads

A kernel is the unit of CUDA code that programmers typically write and compose, akin to a procedure or function in languages targeting CPUs.

A kernel is a function that runs on the GPU, executed by many threads in parallel.
Each thread executes the kernel with a unique thread ID, allowing for data parallelism.

__global__ void myKernel(int *data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x; // Calculate global thread ID
    data[idx] = data[idx] * 2; // Example operation
}

CUDA provides a hierarchical execution model where threads are organized into blocks, and blocks are organized into a grid. This allows for scalable parallelism across a wide range of GPU architectures.

1. 🧮 Graphical Processing Clusters (`GPCs`)

Roughly analogous to the cores of CPUs or workshop handing a load of workload.

GPUs are organized into Graphics Processing Clusters (GPCs), which contain multiple Streaming Multiprocessors (SMs) and other components.
GPCs manage the distribution of work across SMs and handle tasks such as scheduling and memory management, ensuring efficient execution of parallel workloads on the GPU.

Example: Nvidia GTX 1080 has 4 GPCs, each containing 5 SMs of 128 CUDA cores, for a total of 2560 CUDA cores.

Each GPC contains several SMs

2. 🥓 Streaming Multiprocessors (`SMs`)

Highly parallel processing units within the GPU that execute threads in warps.

SMs manage the execution of threads, including scheduling, synchronization, and memory access, allowing for efficient parallel processing of tasks on the GPU.

Each SM then can have multiple Warps & 1 Ray Tracing Unit (RTU) for ray tracing workloads

2.1 🔆 Ray Tracing Units (RTUs)

Specialized hardware for accelerating ray tracing workloads, which are common in graphics rendering and increasingly used in AI applications for tasks like 3D modeling and simulation.

RTUs can perform ray tracing calculations much faster than traditional CUDA cores, enabling real-time ray tracing in graphics applications and accelerating ray tracing-based algorithms in AI workloads.
RTUs are designed to handle the complex calculations involved in ray tracing, such as intersection tests and shading computations, which can be computationally intensive when performed on general-purpose CUDA cores.

2.2 🧶 Warps

Group of 32 threads scheduled and executed together on an SM, in lockstep

Threads in a warp execute the same instruction simultaneously Single Instruction, Multiple Data (SIMD)
- Divergence (different instructions) causes serialization and performance loss
- Design algorithms to minimize divergence for optimal performance
The number of CUDA Cores and Tensor Cores per SM is not fixed at "32 + 1" — it varies by GPU architecture (e.g. Volta has 8 Tensor Cores per SM, Ampere/Hopper have 4); a warp's 32 threads are scheduled onto whichever cores the SM's partition provides

2.2.1 🪡 CUDA Cores

General-purpose cores for executing a wide range of parallel tasks, including graphics rendering and general computeing workloads.

Perform standard floating-point and integer operations, making them suitable for a variety of applications beyond graphics, such as scientific computing, machine learning, and data processing.

Addition, multiplication,
Bitwise operations,

$a \times b + b = d$

2.2.2 🧵 Tensor Cores

Specialized cores designed for accelerating matrix operations, particularly in deep learning workloads.

Perform mixed-precision matrix multiply and accumulate operations, which are common in deep learning algorithms.

$D = A \times B + C$

 GPCs 🧮
  |--> 🥓 SMs
     |--> 🔆 Ray Tracing Units (RTUs)
     |--> 🧶 Warps
         |--> 🪡 CuDA Cores
         |--> 🧵 Tensor Cores

cuDNN (CUDA Deep Neural Network library)

GPU-accelerated library for deep learning primitives.

Provides highly tuned implementations for standard routines such as:

forward and backward convolution
- attention
- matmul
- pooling
- normalization.

Written by Hitesh Sahu, a passionate developer and blogger.

Fri Feb 27 2026

Share This on

← Previous

AI Infra Computing : GPU, DPU, Virtualization, DGX Systems

AI Infra Networking: GPU Clusters, InfiniBand, RoCE, and DPU Integration

AI-Infrastructure/2-0-CUDA

Fetching content, this won’t take long…

🤯 Your stomach gets a new lining every 3–4 days.

Fetching content, this won’t take long…

🤯 Your stomach gets a new lining every 3–4 days.

AI-Infrastructure

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI Programming Model

Overview of NVIDIA's AI programming model, including core libraries (CUDA, NCCL, cuDNN), training vs inference workloads, and compute scaling models (data parallelism and model parallelism) for AI infrastructure.

🪡 Kernels and Threads

1. 🧮 Graphical Processing Clusters (GPCs)

2. 🥓 Streaming Multiprocessors (SMs)

2.1 🔆 Ray Tracing Units (RTUs)

2.2 🧶 Warps

2.2.1 🪡 CUDA Cores

2.2.2 🧵 Tensor Cores

Written by Hitesh Sahu, a passionate developer and blogger.

Fetching content, this won’t take long…

🤯 Your stomach gets a new lining every 3–4 days.

AI-Infrastructure

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI Programming Model

Overview of NVIDIA's AI programming model, including core libraries (CUDA, NCCL, cuDNN), training vs inference workloads, and compute scaling models (data parallelism and model parallelism) for AI infrastructure.

🪡 Kernels and Threads

1. 🧮 Graphical Processing Clusters (GPCs)

2. 🥓 Streaming Multiprocessors (SMs)

2.1 🔆 Ray Tracing Units (RTUs)

2.2 🧶 Warps

2.2.1 🪡 CUDA Cores

2.2.2 🧵 Tensor Cores

Written by Hitesh Sahu, a passionate developer and blogger.

1. 🧮 Graphical Processing Clusters (`GPCs`)

2. 🥓 Streaming Multiprocessors (`SMs`)

1. 🧮 Graphical Processing Clusters (`GPCs`)

2. 🥓 Streaming Multiprocessors (`SMs`)