Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. โ€บ
  3. posts
  4. โ€บ
  5. โ€ฆ

  6. โ€บ
  7. 3 2 Agent Error Analysis

Loading โณ
Fetching content, this wonโ€™t take longโ€ฆ


๐Ÿ’ก Did you know?

๐Ÿ™ Octopuses have three hearts and blue blood.

๐Ÿช This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Loading โณ
Fetching content, this wonโ€™t take longโ€ฆ


๐Ÿ’ก Did you know?

๐Ÿฏ Honey never spoils โ€” archaeologists found 3,000-year-old jars still edible.
Cover Image for Error Analysis for Agentic AI

Error Analysis for Agentic AI

Learn how to systematically diagnose, measure, and improve failures in Agentic AI systems using error analysis. Discover how traces, component-level evaluations, root cause analysis, and observability help identify bottlenecks and drive continuous improvement in AI agent performance.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Sun May 31 2026

Share This on

โ† Previous

Error Analysis in Agentic AI

Next โ†’

Tool Use in Agentic AI

Error Analysis for Agentic AI ๐Ÿ”

How Top Teams Decide What to Fix Next

One of the biggest challenges in building agentic systems is not creating the first version.

It is improving it.

Almost every AI engineer has experienced this:

graph TD
    A[Build Workflow ๐Ÿค–]
    --> B[Test Workflow ๐Ÿ”]

    B --> C[Disappointing Results ๐Ÿ‘Ž๐Ÿป]

    C --> D[Now What?]

The problem is that agentic workflows contain many moving parts.

Without a systematic process, teams often spend weeks optimizing the wrong component.

This is where Error Analysis becomes one of the most valuable skills in Agentic AI engineering.


Why Guessing Fails ๐Ÿค”

Many teams optimize based on intuition:

"This feels like a prompt issue."

"This feels like a retrieval issue."

"This feels like the model is weak."

Sometimes they are right.

Often they are not.

The danger is spending months improving a component that contributes very little to overall performance.

The Engineering Mindset ๐ŸŽฏ

Strong AI teams think like performance engineers.

Instead of asking:

What can we improve?

They ask:

What should we improve?

Those are very different questions.

The first creates endless work.

The second creates measurable progress.

Traces: The X-Ray of Agentic Systems

To diagnose failures, we need visibility into intermediate outputs.

These intermediate outputs are called:

Trace ๐Ÿงพ

A trace contains every step executed by an agent.

One of the most valuable debugging techniques is trace inspection.

A trace records:

  • prompts
  • intermediate reasoning
  • tool calls
  • retrieval outputs
  • state transitions
  • memory updates

Example:

{
  "query": "Recent developments in black hole science",

  "search_terms": [
    "event horizon telescope discoveries",
    "black hole imaging research"
  ],

  "search_results": [
    "https://astrokidnews.com/...",
    "https://spacefunblog.com/..."
  ],

  "selected_sources": [
    "...",
    "..."
  ],

  "summary": "..."
}

Instead of examining only the final answer, we inspect every intermediate step.

Span vs Trace

Two terms frequently appear in observability systems.

Span

A single step.

Example:

  • Generate Search Terms
  • Fetch Web Results

Trace

The complete execution path.

Example: Search Workflow:

graph LR
    A[Search Terms]
    --> B[Search Results]
    --> C[Selected Sources]
    --> D[Summary]

A trace is simply a collection of spans.


The Error Analysis Flywheel

The best teams repeatedly execute:

graph TD
    A[Build ๐Ÿค–]
    --> B[Observe ๐Ÿ‘€]

    B --> C[Trace Analysis ๐Ÿ”Ž]

    C --> D[Identify Bottleneck โš ๏ธ]

    D --> E[Improve Component ๐Ÿ”จ]

    E --> F[Measure Again ๐Ÿ“‹]

    F --> A

Each iteration makes the system incrementally better.

Error Analysis Workflow

A structured approach looks like this:

graph TD
    A[Bad Final Output ๐Ÿ‘Ž๐Ÿป]
    --> B[Inspect Trace ๐Ÿ”]

    B --> C[Identify Weak Component โš ๏ธ]

    C --> D[Count Frequency ๐Ÿ“‹]

    D --> E[Prioritize Fixes ๐Ÿ“]

    E --> F[Improve System ๐Ÿ“ˆ]

This turns debugging into a data-driven process.


Practical Example

Suppose we ask a research agent:

Write a report on recent developments in black hole science.

The generated report misses several important discoveries.

At first glance:

Output Quality = Poor

But that tells us nothing about why.

The root cause could exist anywhere in the workflow.


graph TD
    A[User Query] --> B[Generate Search Terms] 
    B --> C[Web Search] 
    C --> D[Select Best Sources] 
    D --> E[Fetch Documents] 
    E --> F[Summarize] 
    F --> G[Generate Report]

Suppose our agent generated:

1. Search Terms

Black hole theories Einstein
Event Horizon Telescope radio

Question:

Would a human expert use these search terms?

If yes:

Search Term Generation = Good

Move on.

2. Search Results

Returned URLs:

AstroKidNews
SpaceFunBlog
SpaceBot2000

Question:

Would a human researcher use these sources?

Probably not.

A human would likely prefer:

  • Nature
  • Science
  • arXiv
  • NASA
  • ESA

Now we have a clue.

Search Results = Weak

The problem may not be the search terms.

The problem may be the search engine or ranking strategy.

Focus Only on Failures

One subtle but important recommendation:

Do not waste time analyzing successful runs.

Suppose:

Run Result
1 Good
2 Good
3 Poor
4 Good
5 Poor

Focus on:

Run 3
Run 5

These contain the information you need.

This is why it is called:

Error Analysis

The goal is understanding failure modes.

Build an Error Analysis Spreadsheet

One of the simplest and most effective tools is Excel or Google Sheets.

Example:

Query Search Terms Search Results Source Selection Final Output
Black Holes Good Bad Good Bad
Seattle Housing Good Bad Good Bad
Fruit Harvesting Robots Bad Bad Bad Bad

Now count failures.

Example Statistics

Component Error Rate
Search Terms 5%
Search Results 45%
Source Selection 10%
Summarization 8%

This immediately tells us:

Search Results are the largest source of failure.

Quantifying Error Rates

If:

  • 100 traces analyzed
  • 45 traces contain poor search results

Then:

ErrorRatesearch=45100ErrorRate_{search} = \frac{45}{100}ErrorRatesearchโ€‹=10045โ€‹

This provides objective evidence.

Instead of:

"I think search is the issue."

you can say:

"Search contributes to 45% of failures."

Prioritization Matrix

Not every problem deserves immediate attention.

A useful framework is:

Component Error Rate Easy to Fix?
Search Terms 5% Yes
Search Results 45% Yes
Summarization 8% No

Prioritize:

High Error Rate
+
Easy Improvement

This often delivers the largest gains.

Example Improvements

After identifying weak search results, possible fixes include:

1. Better Search Provider

Replace:

Search Engine A

with:

Search Engine B

2. Improved Ranking

Add:

rerank_results()

before source selection.

3. Domain Filtering

Restrict searches to:

nature.com
science.org
arxiv.org
nasa.gov

These targeted improvements are only possible because error analysis revealed the bottleneck.

Error Analysis vs Prompt Engineering

Many beginners immediately modify prompts.

But consider:

Bad Search Results

Will a better summarization prompt help?

Probably not.

The summarizer can only work with the information it receives.

Error analysis prevents optimization in the wrong place.


Final Thoughts

Agentic systems contain many components:

  • planners
  • retrievers
  • search engines
  • evaluators
  • memory systems
  • tool callers
  • generators

When performance is poor, almost any component could be responsible.

Error analysis provides a systematic way to identify:

  1. Which component is failing
  2. How often it fails
  3. Whether it is worth fixing

Without error analysis:

Optimization=GuessingOptimization = GuessingOptimization=Guessing

With error analysis:

Optimization=Evidenceย Basedย EngineeringOptimization = Evidence\ Based\ EngineeringOptimization=Evidenceย Basedย Engineering

And that distinction often determines whether an AI team improves a system in days or spends months optimizing the wrong thing.

AI-AgenticAI/3-2-Agent-Error-Analysis
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich ๐Ÿฅจ, Germany ๐Ÿ‡ฉ๐Ÿ‡ช, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
ย  Home/About
ย  Skills
ย  Work/Projects
ย  Lab/Experiments
ย  Contribution
ย  Awards
ย  Art/Sketches
ย  Thoughts
ย  Contact
Links
ย  Sitemap
ย  Legal Notice
ย  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| ยฉ 2026 All rights reserved.