AI Programming Model
Overview of NVIDIA's AI programming model, including core libraries (CUDA, NCCL, cuDNN), training vs inference workloads, and compute scaling models (data parallelism and model parallelism) for AI infrastructure.
← Previous
AI Infra Computing : GPU, DPU, Virtualization, DGX Systems
Next →
AI Infra Networking: GPU Clusters, InfiniBand, RoCE, and DPU Integration
CUDA (Compute Unified Device Architecture)
Parallel computing platform enabling GPU programming.
- Thousands of parallel threads
- Native C/C++/Python integration
- General-purpose GPU computing
CUDA parallel model:
- Break problem into small identical tasks
- Launch thousands of threads (workers) to do them simultaneously, Collect results when everyone finishes
cuDNN (CUDA Deep Neural Network library)
GPU-accelerated library for deep learning primitives.
Provides highly tuned implementations for standard routines such as:
- forward and backward convolution
- attention
- matmul
- pooling
- normalization.
Training vs Inference
AI Workflow:
Data Preperation
|--> Model Training
|--> Optimization
|--> Inference/Deployment
Model Training
compute intensive
- Forward + backward pass
- Multi-GPU scaling
- High memory + compute demand
- Uses NCCL, NVLink, RDMA
Model Inference
latency optimized
- Forward pass only
- Lower latency focus
- Often containerized (Kubernetes)
| Training | Inference |
|---|---|
| Model learning | Model usage |
| High compute + memory | Lower latency focus |
| Batch workloads | Real-time workloads |
| Multi-GPU scaling | Edge + cloud deployment |
