AI Programming Model
Overview of NVIDIA's AI programming model, including core libraries (CUDA, NCCL, cuDNN), training vs inference workloads, and compute scaling models (data parallelism and model parallelism) for AI infrastructure.
← Previous
AI Infra Storage: NVMe, Parallel File Systems, Object Storage, and GPUDirect Storage
Next →
AI/ML Operations
Core Libraries & Frameworks
1. CUDA (Compute Unified Device Architecture)
Parallel computing platform enabling GPU programming.
- Thousands of parallel threads
- Native C/C++/Python integration
- General-purpose GPU computing
CUDA parallel model:
- Break problem into small identical tasks
- Launch thousands of threads (workers) to do them simultaneously, Collect results when everyone finishes
2. NCCL (NVIDIA Collective Communications Library)
NCCL implements both collective communication and point-to-point send/receive primitives.
- pronounced “Nickel”
- Used by PyTorch & TensorFlow
- It is not a full-blown parallel programming framework; rather, it is a library focused on accelerating inter-GPU communication.
Provides the following collective communication primitives :
- Reduce
- Gather
- Scatter
- ReduceScatter
- AllReduce
- AllGather
- AlltoAll
- Broadcast
3. cuDNN (CUDA Deep Neural Network library)
GPU-accelerated library for deep learning primitives.
Provides highly tuned implementations for standard routines such as:
- forward and backward convolution
- attention
- matmul
- pooling
- normalization.
Training vs Inference
AI Workflow:
Data Preperation
|--> Model Training
|--> Optimization
|--> Inference/Deployment
Model Training
compute intensive
- Forward + backward pass
- Multi-GPU scaling
- High memory + compute demand
- Uses NCCL, NVLink, RDMA
Model Inference
latency optimized
- Forward pass only
- Lower latency focus
- Often containerized (Kubernetes)
| Training | Inference |
|---|---|
| Model learning | Model usage |
| High compute + memory | Lower latency focus |
| Batch workloads | Real-time workloads |
| Multi-GPU scaling | Edge + cloud deployment |
Compute Scaling Models
1. Data Parallelism
- Same model on multiple GPUs
- Split dataset across GPUs
2. Model Parallelism
- Model split across GPUs
- Used for very large models
