Storage
AI workloads require:
- Extremely high throughput
- Parallel access from many GPUs
- Low latency during training
- Scalable capacity for datasets and checkpoints
Bottlenecks in AI Storage
Key Principle: GPUs must never sit idle waiting for data.
Common issues:
- Insufficient I/O throughput
- Network congestion
- Poor file system scaling
- CPU bottlenecks during data movement
Impact: GPU underutilization.
Tiered Storage Architecture
AI data centers use a hybrid storage model.
1. 🔥 Hot Tier (Fastest)
1.1 NVMe SSD (Local Storage)
- Directly attached to server
- Very high IOPS and throughput
- Used for:
- Active model training
- Temporary datasets
- Checkpoints
Limited capacity but extremely fast.
1.2. Network File Systems (Shared Storage)
- Accessible by multiple nodes
- Moderate latency but most common for shared access
- Stored as Blocks or Files
- Used for:
- Shared datasets
- Model checkpoints er (Shared High-Speed)
1.3. Parallel & Distributed File Systems
- Shared across cluster
- High bandwidth
- Scales horizontally
- Supports many GPUs simultaneously
Used for:
- Distributed training
- Shared datasets
- Large-scale HPC workloads
2. ❄ Cold Tier (Long-Term Storage)
2.1 Object Storage
- Massive scalability
- Lower cost per TB
- Higher latency
Used for:
- Raw datasets
- Archived models
- Logs
- Historical checkpoints
Examples:
- S3-compatible systems
- Cloud object storage
Data Locality
Better performance when:
- Data is closer to compute
- Fewer network hops required
Hierarchy: Local NVMe > Parallel FS > Object Storage
Storage Access Patterns in AI
During Training
- Large sequential reads
- Multi-node concurrent access
- Frequent checkpoint writes
Requires:
- High throughput
- Parallel file systems
- RDMA support
During Inference
- Smaller model loads
- Lower bandwidth needs
- Latency-sensitive
Often served via:
- NVMe
- Optimized storage pipelines
RDMA & Storage
Traditional storage path: Storage → CPU → System Memory → GPU
With acceleration: Storage → GPU Memory (Direct)
GPUDirect Storage
- Bypasses CPU and system memory
- Direct path from NVMe or parallel storage to GPU
- Reduces bottlenecks
- Improves training speed
Best for:
- Large dataset ingestion
- High-performance training clusters
NVMe over Fabrics (NVMe-oF)
- Extends NVMe across network
- Enables remote high-speed storage access
- Often combined with RDMA
- Used in HPC and AI clusters
Storage Networking Considerations
Storage traffic must be:
- Isolated from compute fabric
- High bandwidth
- Low contention
- Predictable
AI clusters often separate:
- Compute traffic
- Storage traffic
- Management traffic
Storage Scalability
AI datasets grow rapidly.
Storage must:
- Scale capacity easily
- Maintain performance at scale
- Support multi-node access
Parallel file systems scale horizontally by:
- Adding storage nodes
- Distributing metadata
Storage and Checkpointing
During training:
- Models save checkpoints periodically
- Checkpoints can be large (GBs–TBs)
Storage must handle:
- Frequent writes
- Multi-GPU simultaneous checkpoints
- Recovery after failure
RAID & Data Protection
RAID used for:
- Redundancy
- Performance improvement
- Fault tolerance
In large AI systems:
- Erasure coding often used
- Object storage provides durability
Storage in Cloud vs On-Prem
Cloud
- Object storage dominant
- Elastic scaling
- Pay-as-you-go
On-Prem
- Full control
- Parallel file systems common
- Lower long-term cost at scale
Exam Scenarios to Recognize
If question mentions:
- GPUs starving for data → Storage bottleneck
- Massive shared dataset across nodes → Parallel file system
- Long-term archive → Object storage
- Direct storage-to-GPU transfer → GPUDirect Storage
- Ultra-fast local I/O → NVMe SSD
Quick Memory Anchors
- NVMe = Fastest local storage
- Parallel FS = Shared high-speed cluster storage
- Object storage = Massive, cheap, long-term
- GPUDirect Storage = Bypass CPU
- Training = High throughput demand
- Inference = Lower bandwidth, latency focus
