Core Libraries & Frameworks
1. CUDA (Compute Unified Device Architecture)
Parallel computing platform enabling GPU programming.
- Thousands of parallel threads
- Native C/C++/Python integration
- General-purpose GPU computing
CUDA parallel model:
- Break problem into small identical tasks
- Launch thousands of threads (workers) to do them simultaneously, Collect results when everyone finishes
2. NCCL (NVIDIA Collective Communications Library)
Collective communication library
- Used by PyTorch & TensorFlow
- Optimizes:
- All-reduce
- Broadcast
- Synchronization across GPUs
Training vs Inference
AI Workflow:
Data Preperation
|--> Model Training
|--> Optimization
|--> Inference/Deployment
Model Training
compute intensive
- Forward + backward pass
- Multi-GPU scaling
- High memory + compute demand
- Uses NCCL, NVLink, RDMA
Model Inference
latency optimized
- Forward pass only
- Lower latency focus
- Often containerized (Kubernetes)
| Training | Inference |
|---|---|
| Model learning | Model usage |
| High compute + memory | Lower latency focus |
| Batch workloads | Real-time workloads |
| Multi-GPU scaling | Edge + cloud deployment |
Compute Scaling Models
1. Data Parallelism
- Same model on multiple GPUs
- Split dataset across GPUs
2. Model Parallelism
- Model split across GPUs
- Used for very large models
