Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 4 Storage

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍌 Bananas are berries, but strawberries are not.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for AI Infrastructure

AI Infrastructure

AI infrastructure fundamentals covering GPU hardware selection, cluster scaling, power and cooling design, networking, high-speed interconnects, and DPU integration for modern data centers.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Thu Feb 19 2026

Share This on

Storage

AI workloads require:

  • Extremely high throughput
  • Parallel access from many GPUs
  • Low latency during training
  • Scalable capacity for datasets and checkpoints

Bottlenecks in AI Storage

Key Principle: GPUs must never sit idle waiting for data.

Common issues:

  • Insufficient I/O throughput
  • Network congestion
  • Poor file system scaling
  • CPU bottlenecks during data movement

Impact: GPU underutilization.


Tiered Storage Architecture

AI data centers use a hybrid storage model.

1. 🔥 Hot Tier (Fastest)

1.1 NVMe SSD (Local Storage)

  • Directly attached to server
  • Very high IOPS and throughput
  • Used for:
    • Active model training
    • Temporary datasets
    • Checkpoints

Limited capacity but extremely fast.

1.2. Network File Systems (Shared Storage)

  • Accessible by multiple nodes
  • Moderate latency but most common for shared access
  • Stored as Blocks or Files
  • Used for:
    • Shared datasets
    • Model checkpoints er (Shared High-Speed)

1.3. Parallel & Distributed File Systems

  • Shared across cluster
  • High bandwidth
  • Scales horizontally
  • Supports many GPUs simultaneously

Used for:

  • Distributed training
  • Shared datasets
  • Large-scale HPC workloads

2. ❄ Cold Tier (Long-Term Storage)

2.1 Object Storage

  • Massive scalability
  • Lower cost per TB
  • Higher latency

Used for:

  • Raw datasets
  • Archived models
  • Logs
  • Historical checkpoints

Examples:

  • S3-compatible systems
  • Cloud object storage

Data Locality

Better performance when:

  • Data is closer to compute
  • Fewer network hops required

Hierarchy: Local NVMe > Parallel FS > Object Storage


Storage Access Patterns in AI

During Training

  • Large sequential reads
  • Multi-node concurrent access
  • Frequent checkpoint writes

Requires:

  • High throughput
  • Parallel file systems
  • RDMA support

During Inference

  • Smaller model loads
  • Lower bandwidth needs
  • Latency-sensitive

Often served via:

  • NVMe
  • Optimized storage pipelines

RDMA & Storage

Traditional storage path: Storage → CPU → System Memory → GPU

With acceleration: Storage → GPU Memory (Direct)

GPUDirect Storage

  • Bypasses CPU and system memory
  • Direct path from NVMe or parallel storage to GPU
  • Reduces bottlenecks
  • Improves training speed

Best for:

  • Large dataset ingestion
  • High-performance training clusters

NVMe over Fabrics (NVMe-oF)

  • Extends NVMe across network
  • Enables remote high-speed storage access
  • Often combined with RDMA
  • Used in HPC and AI clusters

Storage Networking Considerations

Storage traffic must be:

  • Isolated from compute fabric
  • High bandwidth
  • Low contention
  • Predictable

AI clusters often separate:

  • Compute traffic
  • Storage traffic
  • Management traffic

Storage Scalability

AI datasets grow rapidly.

Storage must:

  • Scale capacity easily
  • Maintain performance at scale
  • Support multi-node access

Parallel file systems scale horizontally by:

  • Adding storage nodes
  • Distributing metadata

Storage and Checkpointing

During training:

  • Models save checkpoints periodically
  • Checkpoints can be large (GBs–TBs)

Storage must handle:

  • Frequent writes
  • Multi-GPU simultaneous checkpoints
  • Recovery after failure

RAID & Data Protection

RAID used for:

  • Redundancy
  • Performance improvement
  • Fault tolerance

In large AI systems:

  • Erasure coding often used
  • Object storage provides durability

Storage in Cloud vs On-Prem

Cloud

  • Object storage dominant
  • Elastic scaling
  • Pay-as-you-go

On-Prem

  • Full control
  • Parallel file systems common
  • Lower long-term cost at scale

Exam Scenarios to Recognize

If question mentions:

  • GPUs starving for data → Storage bottleneck
  • Massive shared dataset across nodes → Parallel file system
  • Long-term archive → Object storage
  • Direct storage-to-GPU transfer → GPUDirect Storage
  • Ultra-fast local I/O → NVMe SSD

Quick Memory Anchors

  • NVMe = Fastest local storage
  • Parallel FS = Shared high-speed cluster storage
  • Object storage = Massive, cheap, long-term
  • GPUDirect Storage = Bypass CPU
  • Training = High throughput demand
  • Inference = Lower bandwidth, latency focus
AI-ML/4-Storage
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.