Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. โ€บ
  3. posts
  4. โ€บ
  5. โ€ฆ

  6. โ€บ
  7. 3 2 Agent Error Analysis

Loading โณ
Fetching content, this wonโ€™t take longโ€ฆ


๐Ÿ’ก Did you know?

๐ŸŒ Bananas are berries, but strawberries are not.

๐Ÿช This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

AI-AgenticAI

  • AI-AgenticAI Index

  • NVIDIA Agentic AI Professional Certification Path

  • Building Production-Ready Agentic AI Systems

  • Understanding Agentic AI Workflows

  • Understanding Agentic AI Memory

  • Evaluating Agentic AI Systems

  • Error Analysis in Agentic AI

  • Error Analysis for Agentic AI

  • Tool Use in Agentic AI

  • Code Execution in Agentic AI

  • Understanding the Model Context Protocol (MCP)

  • Optimizing Agentic AI Systems

  • Multi-Agent Systems in Agentic AI

  • Understanding Model Fusion in AI Systems

  • Deploying Agents at Scale

  • Deploying Agentic AI to Production

Cover Image for Error Analysis for Agentic AI

Error Analysis for Agentic AI

Learn how to systematically diagnose, measure, and improve failures in Agentic AI systems using error analysis. Discover how traces, component-level evaluations, root cause analysis, and observability help identify bottlenecks and drive continuous improvement in AI agent performance.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Sun May 31 2026

Share This on

โ† Previous

Error Analysis in Agentic AI

Next โ†’

Tool Use in Agentic AI

Error Analysis for Agentic AI ๐Ÿ”

Error classification โ€” the first decision

Everything downstream depends on classifying the error correctly, because the wrong classification wastes retries on permanent failures and gives up too early on transient ones.

1. Transient errors โš ๏ธ

Temporary conditions the system will recover from without any change:

  • Network timeouts
  • 503 Service Unavailable
  • Temporary resource exhaustion.

Handling

These should be retried.

flowchart TD

    Error["Transient Error โš ๏ธ"] --> Backoff["Exponential Backoff with Jitter"]
    Backoff--> Final["Success โœ… or Fail โŒ"]

Retries with exponential backoff and jitter

The retry pattern protects against occasional failures.

The formula is:

wait=min(cap,baseร—2โฟ)+random(0,base)wait = min(cap, base ร— 2^โฟ) + random(0, base)wait=min(cap,baseร—2โฟ)+random(0,base)

Where

  • base = 1s
  • cap = 30s
  • max_retries = 3.

The min(cap, ...) prevents the wait from growing unboundedly.

The + random(...) is jitter โ€” critically important in systems where multiple agents retry simultaneously.

Without jitter, all agents back off to the same interval and then hammer the service in a thundering herd at exactly the same moment.

Jitter spreads them out. A typical configuration:

Caution

Never retry without a budget. max_retries is non-negotiable

An unbounded retry loop will starve other work and can take down a service more effectively than the original failure did.

1.1 Rate-limit errors (429)

API Limit Exceeded

Special case: they are transient, but the retry timing is dictated by the server via a Retry-After header, not by your backoff formula.

Handling

Always honour Retry-After header to retry API call


2. Permanent errors ๐Ÿšจ

Indicate a fundamental problem with the request itself

  • 404 Not Found
  • 400 Bad Request
  • 401 Unauthorized
  • schema validation failures.

Retrying this is pointless and wastes budget.

Handling

Log, skip, and try a fallback or escalate.

flowchart TD

    Error["Permanent errors ๐Ÿšจ"] --> Log["Log Failure"]--> Final["fallback or escalate โŒ"]


3. Circuit breaker โ—

The circuit breaker protects against sustained outages.

The circuit breaker sits in front of every downstream dependency call and tracks the failure rate in a rolling time window.

Why it exists

Without Circuit breaker, every incoming request triggers retries against a downed service, consuming threads, connections, and budget, and potentially causing cascading failure in upstream systems that are waiting for responses.

stateDiagram-v2
    [*] --> CLOSED

    CLOSED --> OPEN : Failure rate exceeds threshold
    OPEN --> HALF_OPEN : Recovery timeout expires

    HALF_OPEN --> CLOSED : Probe request succeeds
    HALF_OPEN --> OPEN : Probe request fails

    state CLOSED {
        [*] --> Healthy
        Healthy : Requests pass through
    }

    state OPEN {
        [*] --> FastFail
        FastFail : Reject requests immediately
    }

    state HALF_OPEN {
        [*] --> Probe
        Probe : Allow limited probe requests
    }

When failures exceed a threshold, it flips to OPEN and fast-fails all subsequent calls immediately โ€” no actual network request made.

After a configured timeout, it enters HALF-OPEN and allows one probe request through.

  • If the probe succeeds, it resets to CLOSED.
  • If it fails, it returns to OPEN and resets the timer.
flowchart TD

    Request[Incoming Request]

    Request --> CB{Circuit State?}

    CB -->|CLOSED| Service[Call Downstream Service]

    Service -->|Success| Success[Return Response]

    Service -->|Failure| FailureCounter[Increment Failure Counter]

    FailureCounter --> Threshold{Threshold Reached?}

    Threshold -->|No| Error[Return Error]
    Threshold -->|Yes| Open[Open Circuit]

    CB -->|OPEN| FastFail[Fast Fail Immediately]

    Open --> Timer[Start Recovery Timer]

    Timer --> HalfOpen[HALF-OPEN]

    HalfOpen --> Probe[Allow Limited Probe Request]

    Probe --> ProbeResult{Probe Success?}

    ProbeResult -->|Yes| CloseCircuit[Close Circuit]
    ProbeResult -->|No| ReopenCircuit[Reopen Circuit]

    CloseCircuit --> Success
    ReopenCircuit --> Timer

Handling

You need one circuit breaker per downstream dependency, never a shared instance.

If service A and service B both fail, they should trip their own breakers independently.


6. Fallback chains โ€” degrading gracefully โ›”

A fallback chain encodes the hierarchy of what to try when each level fails.

The key design principles:

The fallback must be genuinely useful.

Design each tier to return something a user can act on when Error happen.

  • A fallback that returns an empty response or throws a different error is not a fallback, it is a delayed failure.

Transparency is mandatory.

If you returned cached data that may be stale, say so. If you fell back to a weaker model, say so.

Silent degradation is a trust violation โ€” the user believes they got a primary-quality response when they did not.

The final fallback must always succeed.

The bottom of the chain is a safe default โ€” a static template, a human escalation alert, a "service temporarily unavailable" message.

This tier should have no external dependencies and never throw.


How Top Teams Decide What to Fix Next

One of the biggest challenges in building agentic systems is not creating the first version.

It is improving it.

Almost every AI engineer has experienced this:

graph TD
    A[Build Workflow ๐Ÿค–]
    --> B[Test Workflow ๐Ÿ”]

    B --> C[Disappointing Results ๐Ÿ‘Ž๐Ÿป]

    C --> D[Now What?]

The problem is that agentic workflows contain many moving parts.

Without a systematic process, teams often spend weeks optimizing the wrong component.

This is where Error Analysis becomes one of the most valuable skills in Agentic AI engineering.


Why Guessing Fails ๐Ÿค”

Many teams optimize based on intuition:

"This feels like a prompt issue."

"This feels like a retrieval issue."

"This feels like the model is weak."

Sometimes they are right.

Often they are not.

The danger is spending months improving a component that contributes very little to overall performance.

The Engineering Mindset ๐ŸŽฏ

Strong AI teams think like performance engineers.

Instead of asking:

What can we improve?

They ask:

What should we improve?

Those are very different questions.

The first creates endless work.

The second creates measurable progress.

Traces: The X-Ray of Agentic Systems

To diagnose failures, we need visibility into intermediate outputs.

These intermediate outputs are called:

Trace ๐Ÿงพ

A trace contains every step executed by an agent.

One of the most valuable debugging techniques is trace inspection.

A trace records:

  • prompts
  • intermediate reasoning
  • tool calls
  • retrieval outputs
  • state transitions
  • memory updates

Example:

{
  "query": "Recent developments in black hole science",

  "search_terms": [
    "event horizon telescope discoveries",
    "black hole imaging research"
  ],

  "search_results": [
    "https://astrokidnews.com/...",
    "https://spacefunblog.com/..."
  ],

  "selected_sources": [
    "...",
    "..."
  ],

  "summary": "..."
}

Instead of examining only the final answer, we inspect every intermediate step.

Span vs Trace

Two terms frequently appear in observability systems.

Span

A single step.

Example:

  • Generate Search Terms
  • Fetch Web Results

Trace

The complete execution path.

Example: Search Workflow:

graph LR
    A[Search Terms]
    --> B[Search Results]
    --> C[Selected Sources]
    --> D[Summary]

A trace is simply a collection of spans.


The Error Analysis Flywheel

The best teams repeatedly execute:

graph TD
    A[Build ๐Ÿค–]
    --> B[Observe ๐Ÿ‘€]

    B --> C[Trace Analysis ๐Ÿ”Ž]

    C --> D[Identify Bottleneck โš ๏ธ]

    D --> E[Improve Component ๐Ÿ”จ]

    E --> F[Measure Again ๐Ÿ“‹]

    F --> A

Each iteration makes the system incrementally better.

Error Analysis Workflow

A structured approach looks like this:

graph TD
    A[Bad Final Output ๐Ÿ‘Ž๐Ÿป]
    --> B[Inspect Trace ๐Ÿ”]

    B --> C[Identify Weak Component โš ๏ธ]

    C --> D[Count Frequency ๐Ÿ“‹]

    D --> E[Prioritize Fixes ๐Ÿ“]

    E --> F[Improve System ๐Ÿ“ˆ]

This turns debugging into a data-driven process.


Practical Example

Suppose we ask a research agent:

Write a report on recent developments in black hole science.

The generated report misses several important discoveries.

At first glance:

Output Quality = Poor

But that tells us nothing about why.

The root cause could exist anywhere in the workflow.


graph TD
    A[User Query] --> B[Generate Search Terms] 
    B --> C[Web Search] 
    C --> D[Select Best Sources] 
    D --> E[Fetch Documents] 
    E --> F[Summarize] 
    F --> G[Generate Report]

Suppose our agent generated:

1. Search Terms

Black hole theories Einstein
Event Horizon Telescope radio

Question:

Would a human expert use these search terms?

If yes:

Search Term Generation = Good

Move on.

2. Search Results

Returned URLs:

AstroKidNews
SpaceFunBlog
SpaceBot2000

Question:

Would a human researcher use these sources?

Probably not.

A human would likely prefer:

  • Nature
  • Science
  • arXiv
  • NASA
  • ESA

Now we have a clue.

Search Results = Weak

The problem may not be the search terms.

The problem may be the search engine or ranking strategy.

Focus Only on Failures

One subtle but important recommendation:

Do not waste time analyzing successful runs.

Suppose:

Run Result
1 Good
2 Good
3 Poor
4 Good
5 Poor

Focus on:

Run 3
Run 5

These contain the information you need.

This is why it is called:

Error Analysis

The goal is understanding failure modes.

Build an Error Analysis Spreadsheet

One of the simplest and most effective tools is Excel or Google Sheets.

Example:

Query Search Terms Search Results Source Selection Final Output
Black Holes Good Bad Good Bad
Seattle Housing Good Bad Good Bad
Fruit Harvesting Robots Bad Bad Bad Bad

Now count failures.

Example Statistics

Component Error Rate
Search Terms 5%
Search Results 45%
Source Selection 10%
Summarization 8%

This immediately tells us:

Search Results are the largest source of failure.

Quantifying Error Rates

If:

  • 100 traces analyzed
  • 45 traces contain poor search results

Then:

ErrorRatesearch=45100ErrorRate_{search} = \frac{45}{100}ErrorRatesearchโ€‹=10045โ€‹

This provides objective evidence.

Instead of:

"I think search is the issue."

you can say:

"Search contributes to 45% of failures."

Prioritization Matrix

Not every problem deserves immediate attention.

A useful framework is:

Component Error Rate Easy to Fix?
Search Terms 5% Yes
Search Results 45% Yes
Summarization 8% No

Prioritize:

High Error Rate
+
Easy Improvement

This often delivers the largest gains.

Example Improvements

After identifying weak search results, possible fixes include:

1. Better Search Provider

Replace:

Search Engine A

with:

Search Engine B

2. Improved Ranking

Add:

rerank_results()

before source selection.

3. Domain Filtering

Restrict searches to:

nature.com
science.org
arxiv.org
nasa.gov

These targeted improvements are only possible because error analysis revealed the bottleneck.

Error Analysis vs Prompt Engineering

Many beginners immediately modify prompts.

But consider:

Bad Search Results

Will a better summarization prompt help?

Probably not.

The summarizer can only work with the information it receives.

Error analysis prevents optimization in the wrong place.


Final Thoughts

Agentic systems contain many components:

  • planners
  • retrievers
  • search engines
  • evaluators
  • memory systems
  • tool callers
  • generators

When performance is poor, almost any component could be responsible.

Error analysis provides a systematic way to identify:

  1. Which component is failing
  2. How often it fails
  3. Whether it is worth fixing

Without error analysis:

Optimization=GuessingOptimization = GuessingOptimization=Guessing

With error analysis:

Optimization=Evidenceย Basedย EngineeringOptimization = Evidence\ Based\ EngineeringOptimization=Evidenceย Basedย Engineering

And that distinction often determines whether an AI team improves a system in days or spends months optimizing the wrong thing.

Handling Failures

Retries

For architectural justification retries handle the common case of transient failures in distributed systems.

Circuit breakers

Circuit breakers prevent the retry pattern from amplifying sustained outages into cascading failures across the system.

Fallback

Fallback chains ensure the system degrades to reduced but functional capability rather than complete unavailability.

Together they form a defence-in-depth strategy: retries for noise, circuit breakers for sustained outages, fallbacks for the cases where recovery isn't possible within the task's time budget.

All three should be present in any production agent system that calls external services.

โ† Previous

Error Analysis in Agentic AI

Next โ†’

Tool Use in Agentic AI

AI-AgenticAI/3-2-Agent-Error-Analysis
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich ๐Ÿฅจ, Germany ๐Ÿ‡ฉ๐Ÿ‡ช, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
ย  Home/About
ย  Skills
ย  Work/Projects
ย  Lab/Experiments
ย  Contribution
ย  Awards
ย  Art/Sketches
ย  Thoughts
ย  Contact
Links
ย  Sitemap
ย  Legal Notice
ย  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| ยฉ 2026 All rights reserved.