Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 3 1 Agent Eval Errors

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for Error Analysis in Agentic AI

Error Analysis in Agentic AI

Learn how Error Analysis helps diagnose failures in Agentic AI systems by identifying bottlenecks, inspecting traces, and measuring component-level performance. Discover practical techniques for root cause analysis, observability, and continuous improvement of AI agents in production.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Sun May 31 2026

Share This on

← Previous

Evaluating Agentic AI Systems

Next →

Error Analysis for Agentic AI

Building Effective Evals for Agentic AI Systems

Most AI engineers focus on:

  • prompts
  • models
  • tools
  • frameworks

But after building several agentic systems, you quickly realize something surprising:

The quality of your evals often matters more than the quality of your prompts.

In traditional software engineering, correctness is usually deterministic.

In Agentic AI, correctness is often probabilistic.

This changes everything.

The most successful teams are not necessarily the ones with the biggest models.

They are often the ones with the most disciplined evaluation processes.

The Biggest Mistake Teams Make

When building an agentic workflow, many developers spend weeks discussing architecture before implementing anything.

A more effective approach is:

graph TD
    A[Build Prototype]
    --> B[Test Real Inputs]
    --> C[Find Failures]
    --> D[Create Evals]
    --> E[Improve System]

The reality is:

You usually do not know what will fail until you see the system running.

This is one of the core differences between traditional software engineering and Agentic AI development.

Why Small Evals Are Fine

Many teams delay evaluation because they believe:

We need thousands of examples.

Usually not.

A practical starting point:

10–20 examples

is often enough to identify:

  • major failure modes
  • regression risks
  • improvement opportunities

You can always expand later.

Evals Evolve With The System

As the workflow improves:

graph TD
    A[Version 1]
    --> B[New Failures Found]
    --> C[Update Eval Set]
    --> D[Version 2]

Your evaluation suite should evolve alongside the agent.

Just like production code.

The Evaluation Flywheel

The most effective teams continuously iterate:

graph TD
    A[Build]
    --> B[Observe]
    --> C[Analyze]
    --> D[Create Eval]
    --> E[Improve]
    --> F[Measure]
    --> A

This cycle repeats indefinitely.

Every improvement produces new insights.

Every failure becomes training data.

Start With a Quick and Dirty Prototype

Suppose you are building an invoice processing agent.

The workflow:

graph TD
    A[Invoice PDF]
    --> B[OCR]
    --> C[LLM Extraction]
    --> D[Database]

The agent extracts:

  • vendor name
  • address
  • amount
  • due date

The temptation is to immediately design a sophisticated evaluation framework.

Don't.

Instead:

  1. Build a prototype
  2. Run 10-20 invoices
  3. Inspect outputs manually
  4. Identify failure patterns

Error Analysis Comes Before Evals

Error analysis is the process of manually reviewing outputs to identify common failure modes.

Error Analysis

graph TD
    A[Outputs]
    --> B[Manual Review]
    --> C[Identify Failure Pattern]
    --> D[Create Eval]

Error Analysis Workflow

A structured approach looks like this:



graph TD
A[Bad Final Output]
--> B[Inspect Trace]

    B --> C[Identify Weak Component]

    C --> D[Count Frequency]

    D --> E[Prioritize Fixes]

    E --> F[Improve System]

This turns debugging into a data-driven process.

Without error analysis, you risk measuring the wrong thing.

The Four Types of Evals

A useful framework is a 2×2 matrix.

quadrantChart
    title Evaluation Framework
    x-axis No Ground Truth --> Ground Truth
    y-axis Objective --> Subjective

    quadrant-1 Objective + Ground Truth
    quadrant-2 Subjective + Ground Truth
    quadrant-3 Objective + No Ground Truth
    quadrant-4 Subjective + No Ground Truth

1. Objective + Ground Truth

Failures measurable with deterministic rules and per-example labels.

Metric:

Accuracy=Correct PredictionTotal ExamplesAccuracy = \frac{Correct\ Prediction} {Total\ Examples}Accuracy=Total ExamplesCorrect Prediction​

Structured outputs make objective evals significantly easier.

Python implementation:


def evaluate(actual, predicted):
    return actual == predicted

# Overall accuracy:
accuracy = correct_predictions / total_examples

Examples:

  • Invoice extraction
  • Classification
  • Named entity extraction

2. Objective + No Ground Truth

Failures measurable with deterministic rules but no per-example labels.

Metric:

rule_compliance

Examples:

  • Word count limits
  • JSON validity
  • Formatting rules

Consider a marketing copy agent have Requirement:

Instagram captions must be 10 words or fewer.

Example outputs:

Stylish sunglasses built for every summer adventure

Word count = 7

Pass.

Evaluation Logic

def word_count(text):
    return len(text.split())

Metric:

count <= 10

No custom label is needed.

Every example shares the same rule.

Accuracy Calculation

Compliance=Captions Under 10 WordsTotal CaptionsCompliance = \frac{Captions\ Under\ 10\ Words} {Total\ Captions}Compliance=Total CaptionsCaptions Under 10 Words​

This is objective evaluation without per-example ground truth.


3. Subjective + Ground Truth

Failures not easily measurable with deterministic rules but with per-example labels.

Metric:

LLM Judge + Gold Labels

Examples:

  • Research quality
  • Fact coverage
  • Content completeness

Suppose the task is:

Write a report about black hole research.

You review outputs and discover:

Important discoveries are frequently omitted.

The challenge:

Different reports may discuss the same topic using different wording.

Simple regex matching becomes unreliable.

1. Gold Standard Talking Points

For each topic:

{
  "topic": "Black Holes",
  "important_points": [
    "Event Horizon",
    "Hawking Radiation",
    "Gravitational Waves",
    "Black Hole Imaging",
    "Accretion Disk Physics"
  ]
}

Now we have domain-specific ground truth.

2. LLM-as-a-Judge

Instead of writing rules:

if "Event Horizon" in report:

we ask another LLM:

Determine how many gold standard points
are present in this report.

Architecture:

graph TD
    A[Research Agent]
    --> B[Generated Report]

    B --> C[Judge LLM]

    C --> D[Coverage Score]

This handles paraphrasing much better than keyword matching.


4. Subjective + No Ground Truth

Examples:

  • Chart aesthetics
  • Writing style
  • Clarity
  • Creativity

Metric:

LLM Judge + Rubric

This is often the hardest category.


Human-Level Performance as a Benchmark

A useful question:

Where is the agent still worse than an expert human?

Those gaps often reveal the highest-value improvements.

For example:

Task Human Agent
Date Extraction 99% 90%
Research Coverage 95% 72%
Policy Compliance 98% 91%

These become your roadmap.


Final Thoughts

The biggest misconception in Agentic AI is that success comes from better prompts.

In practice:

Production Success≠Prompt QualityProduction\ Success \neq Prompt\ QualityProduction Success=Prompt Quality

A more accurate equation is:

Production Success=Quality(Workflow)+Quality(Evals)+Iteration SpeedProduction\ Success = Quality(Workflow) + Quality(Evals) + Iteration\ SpeedProduction Success=Quality(Workflow)+Quality(Evals)+Iteration Speed

The best agentic systems are not built through perfect planning.

They are built through disciplined experimentation, rigorous evaluation, and continuous improvement.

And that is why evaluation is one of the most important skills in modern AI engineering.


AI-AgenticAI/3-1-Agent-Eval-Errors
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.