Deploying Agents at Scale
Learn how to deploy AI agents reliably in production using containerization, orchestration, observability, evaluation pipelines, guardrails, retries, scaling strategies, and resilient architectures. Explore best practices for running agentic systems across cloud environments while maintaining performance, reliability, security, and cost efficiency.
Deploying Agents at Scale
NVIDIA Inference Stack âĄ
flowchart TD
Training["Training <br/> PyTorch / NeMo Model"]--> Model["ONNX đĻ"]
User --> NIM["NIM <br/> Production APIs"]
NIM --> Triton["Triton Server đŗ <br/>Inference"]
Triton--> TensorRT-LLM["TensorRT-LLM đ˛ <br/> Runtime"]
Model-->TensorRT-LLM
TensorRT-LLM--> GPU["GPU Rack đ§Ž"]
Components
| Component | Purpose |
|---|---|
CUDA |
GPU execution platform |
TensorRT-LLM |
LLM optimization |
Triton |
Model serving |
NIM |
Packaged inference microservice |
Kubernetes |
Deployment & scaling |
Build using Docker đŗ
Packages the agent, its model config, tool dependencies, and runtime into a single reproducible image.
Pin all versions model weights, libraries, Python for deterministic builds.
Pipeline
flowchart TD
Commit["code commit đ¤"]
Eval["Automated Eval đ"]
Build["Container Build đĻ"]
Deploy["Shadow Deployment đ "]
rollback["Promote/Rollback đŗ"]
Commit-->Eval-->Build-->Deploy-->rollback
Automated Eval đ
Running a benchmark suite against a golden dataset to catch regression in agent behaviour before it reaches users.
Shadow Deployment đ
Benchmarking new build with real user traffic against existing prod deployment without affecting any user
Goal
- Observe performance with real world traffic
- Validate if model is performing well
- Benchmark with current Live Model
More of this in next Post
Validate with real traffic.
flowchart LR
User
--> Production
User -. Mirror .-> Shadow
Production --> Response
Shadow -. Discard .-> Trash
Promote/Rollback đŗ
Final Rollout to production using BlueGreen or Canary Deployment
Expose to a real users gradually with Canary.
Kubernetes â¸ī¸
Orchestrates multiple containers at scale.
Handles scheduling, health checks, rolling updates, and auto-scaling.
Each agent type runs as a Deployment with its own replica count and resource limits.
production-grade pattern used by many AI agent platforms.
- Deploy stateless agent workers as a Kubernetes Deployment.
- Use
Redis or Kafkaas a task queue. - Expose
queue depthas an external metric. - Configure an
HPAorKEDAScaledObject. - Scale replicas based on queue depth.
- Store session state and memory externally.
flowchart LR
Producers --> Queue
Queue --> Agent1
Queue --> Agent2
Queue --> Agent3
Queue -. task_queue_depth .-> HPA
HPA -. Scale Up/Down .-> Deployment
Deployment --> Agent1
Deployment --> Agent2
Deployment --> Agent3
Horizontal Pod Autoscaler (HPA) âī¸
Scales replica count up/down based on CPU, memory, or custom metrics (e.g. queue depth).
Handles traffic spikes without manual intervention.
Example application/deployment.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler # Horizontally scale pods
metadata:
name: agent-worker-hpa # identifier for deployment
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: agent-worker
minReplicas: 2 # HPA will never scale below 2 replicas
maxReplicas: 20 # Never scale above 20 pods
metrics:
- type: External
external:
metric:
name: task_queue_depth # custom metric from Redis
target:
type: AverageValue
averageValue: "10" # scale when >10 tasks/replica
Deploying agent-worker-hpa
Deployment
kubectl apply -f https://k8s.io/examples/application/deployment.yaml
Validate
# Verify deployment
kubectl get deployment agent-worker
# Verify HPA
kubectl get hpa
# Debug HPA
kubectl describe hpa agent-worker-hpa
Hierarchical orchestration
One orchestrator agent fans out sub-tasks to specialist workers.
The orchestrator holds the plan; workers are stateless executors.
At scale, the orchestrator itself can be replicated with task queues (Redis, Kafka) providing coordination.
Horizontal agent scaling
Deploy multiple identical worker agent replicas behind a load balancer.
Each replica handles independent tasks
Stateless design is key so any replica can pick up any task.
flowchart TD
Producers["Producers / API Gateway đ "]
LoadBalancer["Load Balancer đĻ "]
AgentA["Agent Replica A đ¤ <br/>Stateless"]
AgentB["Agent Replica Bđ¤ <br/>Stateless"]
AgentC["Agent Replica C đ¤<br/>Stateless"]
Queue["Task Queue đĨ <br/>Redis / Kafka"]
Session["Session State âšī¸ <br/>Redis / DynamoDB"]
LTM["Long-Term Memory đĸ <br/>Vector DB / RAG"]
ToolCache[Tool Cache<br/>Redis / Memcached]
Producers --> LoadBalancer
LoadBalancer --> |monitor| AgentA
LoadBalancer --> |monitor| AgentB
LoadBalancer --> |monitor| AgentC
Producers --> Queue
Queue --> |poll| AgentA
Queue --> |poll| AgentB
Queue --> |poll| AgentC
AgentA <--> Session
AgentB <--> Session
AgentC <--> Session
AgentA <--> LTM
AgentB <--> LTM
AgentC <--> LTM
AgentA <--> ToolCache
AgentB <--> ToolCache
AgentC <--> ToolCache
KEDA (Kubernetes Event-Driven Autoscaling)
Extends Kubernetes autoscaling beyond CPU and memory.
Common triggers:
- Redis Queue Depth
- Kafka Lag
- RabbitMQ Queue Length
- AWS SQS Messages
KEDA automatically creates and manages an HPA.
flowchart LR
Queue -. Queue Depth .-> KEDA
KEDA --> HPA
HPA --> Deployment
Task queue decoupling đĨ
Decouple task submission from execution via a queue.
Producers push tasks; agent workers pull and process.
Enables backpressure, retry, and independent scaling of producers vs consumers.
1. Routes by queue depth
Reads the actual task queue depth from each replica (or from Redis) and routes to the one with the shortest queue.
The most accurate signal for agent workloads â accounts for queued but not-yet-started tasks.
2. Round Robin
Cycles through replicas in order regardless of their current load.
Simple and fair for uniform workloads, but can create hotspots when some agent tasks take much longer than others.
3. Least Connection
Always picks the replica with the fewest active tasks.
Ideal for agents because task duration varies wildly
a 20-step reasoning chain holds a connection far longer than a 2-step lookup.
4. Weighted
Replicas are assigned weights reflecting their capacity.
- A GPU-backed replica might get weight 3, a CPU replica weight 1 â meaning it receives 3Ã the traffic.
Good when replicas have different specs.
5. Random
Picks a replica at random.
Statistically converges to even distribution at scale but can cluster by chance on small request counts.
Low overhead â no state to track.
ConfigMap / Secret đ
Externalise model endpoint URLs, API keys, and policy configs from the image enables config changes without a rebuild.
GPU node pool
Schedule inference pods on GPU nodes using nodeSelector or taints/tolerations. NVIDIA device plugin exposes GPUs as schedulable resources.
