AI Programming Model
Overview of NVIDIA's AI programming model, including core libraries (CUDA, NCCL, cuDNN), training vs inference workloads, and compute scaling models (data parallelism and model parallelism) for AI infrastructure.
AI Infra Computing : GPU, DPU, Virtualization, DGX Systems
AI Infra Networking: GPU Clusters, InfiniBand, RoCE, and DPU Integration
📟 CUDA (Compute Unified Device Architecture)
Parallel computing platform enabling GPU programming.
- Thousands of parallel threads
- Native C/C++/Python integration
- General-purpose GPU computing
CUDA parallel model:
- Break problem into small identical tasks
- Launch thousands of threads (workers) to do them simultaneously, Collect results when everyone finishes
Before CUDA, programmers had to use graphics APIs (OpenGL, DirectX) to leverage GPU power for non-graphics tasks, which was complex and inefficient. CUDA provided a more direct and flexible way to program GPUs for general-purpose computing.
graph TD
A[CPU] -->| Graphics API ✨ | B[GPU 🧮]
B -->|Limited Access| C[Data Processing]
C -->|Results| A
With CUDA, developers can write code that runs on the GPU, allowing for significant performance improvements in tasks that can be parallelized, such as matrix operations, simulations, and deep learning workloads.
graph TD
A[CPU] -->|CUDA API 📟| B[GPU 🧮]
B -->|Parallel Threads 🪡| C[Data Processing]
C -->|Results| A
🪡 Kernels and Threads
A kernel is the unit of CUDA code that programmers typically write and compose, akin to a procedure or function in languages targeting CPUs.
- A
kernelis a function that runs on the GPU, executed by many threads in parallel. - Each thread executes the kernel with a unique thread ID, allowing for data parallelism.
__global__ void myKernel(int *data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x; // Calculate global thread ID
data[idx] = data[idx] * 2; // Example operation
}
- CUDA provides a hierarchical execution model where threads are organized into blocks, and blocks are organized into a grid. This allows for scalable parallelism across a wide range of GPU architectures.
1. 🧮 Graphical Processing Clusters (GPCs)
Roughly analogous to the cores of CPUs or workshop handing a load of workload.
- GPUs are organized into
Graphics Processing Clusters(GPCs), which contain multipleStreaming Multiprocessors(SMs) and other components. - GPCs manage the distribution of work across SMs and handle tasks such as scheduling and memory management, ensuring efficient execution of parallel workloads on the GPU.
Example: Nvidia GTX 1080 has 15 GPCs, each containing 1280 CUDA cores, for a total of 1920 CUDA cores.
Each GPC contains several SMs
2. 🥓 Streaming Multiprocessors (SMs)
Highly parallel processing units within the GPU that execute threads in warps.
- SMs manage the execution of threads, including scheduling, synchronization, and memory access, allowing for efficient parallel processing of tasks on the GPU.
Each SM then can have multiple Warps & 1 Ray Tracing Unit (RTU) for ray tracing workloads
2.1 🔆 Ray Tracing Units (RTUs)
Specialized hardware for accelerating ray tracing workloads, which are common in graphics rendering and increasingly used in AI applications for tasks like 3D modeling and simulation.
- RTUs can perform ray tracing calculations much faster than traditional CUDA cores, enabling real-time ray tracing in graphics applications and accelerating ray tracing-based algorithms in AI workloads.
- RTUs are designed to handle the complex calculations involved in ray tracing, such as intersection tests and shading computations, which can be computationally intensive when performed on general-purpose CUDA cores.
2.2 🧶 Warps
Group of
32 CUDA Cores&1 Tensor Corethat execute the same instruction simultaneously
- Threads in a warp execute the same instruction simultaneously
Single Instruction, Multiple Data(SIMD)- Divergence (different instructions) causes serialization and performance loss
- Design algorithms to minimize divergence for optimal performance
2.2.1 🪡 CUDA Cores
General-purpose cores for executing a wide range of parallel tasks, including graphics rendering and general computeing workloads.
Perform standard floating-point and integer operations, making them suitable for a variety of applications beyond graphics, such as scientific computing, machine learning, and data processing.
- Addition, multiplication,
- Bitwise operations,
2.2.2 🧵 Tensor Cores
Specialized cores designed for accelerating matrix operations, particularly in deep learning workloads.
- Perform mixed-precision matrix multiply and accumulate operations, which are common in deep learning algorithms.
Where A, B, C, and D are matrices. Tensor Cores can perform this operation much faster than traditional CUDA cores, especially when using lower precision formats like FP16 or INT8, which are often sufficient for deep learning tasks.
GPCs 🧮
|--> 🥓 SMs
|--> 🔆 Ray Tracing Units (RTUs)
|--> 🧶 Warps
|--> 🪡 CuDA Cores
|--> 🧵 Tensor Cores
cuDNN (CUDA Deep Neural Network library)
GPU-accelerated library for deep learning primitives.
Provides highly tuned implementations for standard routines such as:
- forward and backward convolution
- attention
- matmul
- pooling
- normalization.
