Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 6 AI Ops

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for AI/ML Operations

AI/ML Operations

Comprehensive overview of AI operations including data center monitoring, GPU observability, cluster orchestration, job scheduling, and virtualization strategies for accelerated infrastructure.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Thu Feb 19 2026

Share This on

Monitoring & Operations AI Infrastructure

AI clusters are:

  • GPU-dense
  • Power-hungry
  • Network-intensive
  • Storage-dependent

Goal of monitoring:

  • Maximize GPU utilization
  • Detect failures early
  • Prevent downtime
  • Optimize performance
  • Ensure thermal and power stability

Monitoring Layers

AI data centers monitor multiple layers:

1️⃣ Hardware Layer

  • GPU temperature
  • GPU utilization
  • Power draw
  • CPU usage
  • Memory usage
  • Disk I/O
  • NIC throughput

2️⃣ Network Layer

  • Latency
  • Packet loss
  • RDMA errors
  • Congestion
  • Throughput

3️⃣ Storage Layer

  • IOPS
  • Throughput
  • Latency
  • File system saturation

4️⃣ Application Layer

  • Training job status
  • Job queue depth
  • Container health
  • Pod failures (Kubernetes)

1. GPU Monitoring Tools

1.1. nvidia-smi

Checking GPU status on single system

  • CLI tool
  • Quick GPU diagnostics
  • Per-GPU statistics

1.2 DCGM (Data Center GPU Manager)

  • Monitoring 10+ GPU nodes

  • Kubernetes Cluster level GPU Management
  • Historical data collection for Health checks
  • Allow adding Alerting
  • Cluster-wide GPU insights

Keep Eyes on:

  • Temperature
  • Power consumption
  • ECC errors
  • Utilization metrics

1.3 Base Command Manager (BCM)

  • Mange entire cluster of GPU nodes in AI Data Center
  • Job Scheduling and Monitoring
  • Multi Team/ User/ environment management
  • Ensure Scale optimal resource allocation

2. Infrastructure Monitoring Tools

2.1 Prometheus

  • Time-series metrics collection
  • Scrapes metrics from nodes
  • Stores and queries metrics

2.2 Grafana

  • Visualization dashboards
  • Real-time monitoring
  • Alerting integration

2.3 Node Exporter

  • Collects system-level metrics
  • CPU, memory, disk, network

3 Cluster Orchestration Monitoring

3.1 Kubernetes

Monitor:

  • Pod status
  • Node health
  • Resource usage
  • Scheduling issues

3.2 Slurm (Simple Linux Utility for Resource Management) (HPC environments)

open-source cluster resource management and job scheduling system that strives to be simple, scalable, portable, fault-tolerant, and interconnect agnostic

Monitors:

  • Job queue
  • Resource allocation
  • Failed jobs
  • Node states

Logging Systems

Logs provide:

  • Error tracing
  • Job debugging
  • Security auditing
  • System failure analysis

Centralized logging:

  • Aggregated logs
  • Searchable
  • Long-term retention

Alerting Strategy

Monitoring without alerting = useless.

Effective alerts:

  • Temperature threshold exceeded
  • GPU ECC errors
  • Node unreachable
  • Disk nearly full
  • Network congestion

Alerts should be:

  • Actionable
  • Prioritized
  • Not noisy

Power & Cooling Monitoring

AI clusters consume massive power.

Monitor:

  • Rack power draw
  • PSU health
  • Cooling system efficiency
  • Data center temperature
  • Airflow

Failure to monitor → thermal shutdown.

Cooling Options

1. Air Colling

  • Max at 30 kW per rack
  • Less efficient at high densities
  • Lower infrastructure cost

2. Liquid Cooling

  • Better for high density racks (30–80 kW+)
  • More efficient heat removal
  • Expensive infrastructure

High Availability (HA)

AI infrastructure should support:

  • Redundant power supplies
  • Redundant networking paths
  • Failover nodes
  • Backup storage

Single point of failure = unacceptable.

Failure Scenarios to Understand

Common failures:

  • GPU overheating
  • Node crash
  • Network congestion
  • Storage saturation
  • Job scheduler deadlock

Monitoring enables:

  • Rapid detection
  • Root cause analysis
  • Faster recovery

Security Monitoring

Includes:

  • Unauthorized access attempts
  • Configuration changes
  • Network anomalies
  • DPU isolation policies
  • Role-based access control

Capacity Planning

Operations teams must:

  • Track GPU utilization trends
  • Forecast storage growth
  • Monitor network saturation
  • Plan rack power expansion

Goal: Avoid capacity shortages.


Lifecycle Management

Operations includes:

  • Firmware updates
  • Driver updates
  • CUDA updates
  • Security patches
  • Hardware replacement

Change management must:

  • Minimize downtime
  • Be documented
  • Be tested

Observability vs Monitoring

Monitoring:

  • What is happening?

Observability:

  • Why is it happening?

Observability includes:

  • Metrics
  • Logs
  • Traces

GPU Utilization Optimization

Low GPU utilization may indicate:

  • Storage bottleneck
  • Network congestion
  • Poor job scheduling
  • Insufficient batch size
  • CPU bottleneck

Operations teams investigate before scaling hardware.


Exam Scenarios to Recognize

If question mentions:

  • GPU temperature spikes → Thermal monitoring
  • ECC memory errors → DCGM
  • Dashboard visualization → Grafana
  • Metric scraping → Prometheus
  • HPC job queue management → Slurm
  • Container orchestration → Kubernetes
  • Rack power issue → Data center monitoring
  • Underutilized GPUs → Operational inefficiency

Quick Memory Anchors

  • DCGM = GPU health monitoring
  • Prometheus = Metrics collection
  • Grafana = Visualization
  • Slurm = HPC job scheduler
  • Kubernetes = Container orchestration
  • Monitoring prevents GPU idle time
  • Alerting must be actionable
AI-ML/6-AI-Ops
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.