Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🍌 Bananas are berries, but strawberries are not.

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-AgenticAI

Error Analysis in Agentic AI

Learn how Error Analysis helps diagnose failures in Agentic AI systems by identifying bottlenecks, inspecting traces, and measuring component-level performance. Discover practical techniques for root cause analysis, observability, and continuous improvement of AI agents in production.

Artificial Intelligence

Agentic AI

AI Agents

Error Analysis

AI Evaluation

Root Cause Analysis

← Previous

Evaluating Agentic AI Systems

Error Analysis for Agentic AI

Building Effective Evals for Agentic AI Systems

Most AI engineers focus on:

prompts
models
tools
frameworks

But after building several agentic systems, you quickly realize something surprising:

The quality of your evals often matters more than the quality of your prompts.

In traditional software engineering, correctness is usually deterministic.

In Agentic AI, correctness is often probabilistic.

This changes everything.

The most successful teams are not necessarily the ones with the biggest models.

They are often the ones with the most disciplined evaluation processes.

The Biggest Mistake Teams Make

When building an agentic workflow, many developers spend weeks discussing architecture before implementing anything.

A more effective approach is:

graph TD
    A[Build Prototype]
    --> B[Test Real Inputs]
    --> C[Find Failures]
    --> D[Create Evals]
    --> E[Improve System]

The reality is:

You usually do not know what will fail until you see the system running.

This is one of the core differences between traditional software engineering and Agentic AI development.

Why Small Evals Are Fine

Many teams delay evaluation because they believe:

We need thousands of examples.

Usually not.

A practical starting point:

10–20 examples

is often enough to identify:

major failure modes
regression risks
improvement opportunities

You can always expand later.

Evals Evolve With The System

As the workflow improves:

graph TD
    A[Version 1]
    --> B[New Failures Found]
    --> C[Update Eval Set]
    --> D[Version 2]

Your evaluation suite should evolve alongside the agent.

Just like production code.

The Evaluation Flywheel

The most effective teams continuously iterate:

graph TD
    A[Build]
    --> B[Observe]
    --> C[Analyze]
    --> D[Create Eval]
    --> E[Improve]
    --> F[Measure]
    --> A

This cycle repeats indefinitely.

Every improvement produces new insights.

Every failure becomes training data.

Start With a Quick and Dirty Prototype

Suppose you are building an invoice processing agent.

The workflow:

graph TD
    A[Invoice PDF]
    --> B[OCR]
    --> C[LLM Extraction]
    --> D[Database]

The agent extracts:

vendor name
address
amount
due date

The temptation is to immediately design a sophisticated evaluation framework.

Don't.

Instead:

Build a prototype
Run 10-20 invoices
Inspect outputs manually
Identify failure patterns

Error Analysis Comes Before Evals

Error analysis is the process of manually reviewing outputs to identify common failure modes.

Error Analysis

graph TD
    A[Outputs]
    --> B[Manual Review]
    --> C[Identify Failure Pattern]
    --> D[Create Eval]

Error Analysis Workflow

A structured approach looks like this:



graph TD
A[Bad Final Output]
--> B[Inspect Trace]

    B --> C[Identify Weak Component]

    C --> D[Count Frequency]

    D --> E[Prioritize Fixes]

    E --> F[Improve System]

This turns debugging into a data-driven process.

Without error analysis, you risk measuring the wrong thing.

The Four Types of Evals

A useful framework is a 2×2 matrix.

quadrantChart
    title Evaluation Framework
    x-axis No Ground Truth --> Ground Truth
    y-axis Objective --> Subjective

    quadrant-1 Objective + Ground Truth
    quadrant-2 Subjective + Ground Truth
    quadrant-3 Objective + No Ground Truth
    quadrant-4 Subjective + No Ground Truth

1. Objective + Ground Truth

Failures measurable with deterministic rules and per-example labels.

Metric:

Accuracy = \frac{Correct\ Prediction} {Total\ Examples}

Structured outputs make objective evals significantly easier.

Python implementation:


def evaluate(actual, predicted):
    return actual == predicted

# Overall accuracy:
accuracy = correct_predictions / total_examples

Examples:

Invoice extraction
Classification
Named entity extraction

2. Objective + No Ground Truth

Failures measurable with deterministic rules but no per-example labels.

Metric:

rule_compliance

Examples:

Word count limits
JSON validity
Formatting rules

Consider a marketing copy agent have Requirement:

Instagram captions must be 10 words or fewer.

Example outputs:

Stylish sunglasses built for every summer adventure

Word count = 7

Pass.

Evaluation Logic

def word_count(text):
    return len(text.split())

Metric:

count <= 10

No custom label is needed.

Every example shares the same rule.

Accuracy Calculation

Compliance = \frac{Captions\ Under\ 10\ Words} {Total\ Captions}

This is objective evaluation without per-example ground truth.

3. Subjective + Ground Truth

Failures not easily measurable with deterministic rules but with per-example labels.

Metric:

LLM Judge + Gold Labels

Examples:

Research quality
Fact coverage
Content completeness

Suppose the task is:

Write a report about black hole research.

You review outputs and discover:

Important discoveries are frequently omitted.

The challenge:

Different reports may discuss the same topic using different wording.

Simple regex matching becomes unreliable.

1. Gold Standard Talking Points

For each topic:

{
  "topic": "Black Holes",
  "important_points": [
    "Event Horizon",
    "Hawking Radiation",
    "Gravitational Waves",
    "Black Hole Imaging",
    "Accretion Disk Physics"
  ]
}

Now we have domain-specific ground truth.

2. LLM-as-a-Judge

Instead of writing rules:

if "Event Horizon" in report:

we ask another LLM:

Determine how many gold standard points
are present in this report.

Architecture:

graph TD
    A[Research Agent]
    --> B[Generated Report]

    B --> C[Judge LLM]

    C --> D[Coverage Score]

This handles paraphrasing much better than keyword matching.

4. Subjective + No Ground Truth

Examples:

Chart aesthetics
Writing style
Clarity
Creativity

Metric:

LLM Judge + Rubric

This is often the hardest category.

Human-Level Performance as a Benchmark

A useful question:

Where is the agent still worse than an expert human?

Those gaps often reveal the highest-value improvements.

For example:

Task	Human	Agent
Date Extraction	99%	90%
Research Coverage	95%	72%
Policy Compliance	98%	91%

These become your roadmap.

Final Thoughts

The biggest misconception in Agentic AI is that success comes from better prompts.

In practice:

Production\ Success \neq Prompt\ Quality

A more accurate equation is:

Production\ Success = Quality(Workflow) + Quality(Evals) + Iteration\ Speed

The best agentic systems are not built through perfect planning.

They are built through disciplined experimentation, rigorous evaluation, and continuous improvement.

And that is why evaluation is one of the most important skills in modern AI engineering.

Written by Hitesh Sahu, a passionate developer and blogger.

Sun May 31 2026

Share This on

← Previous

Evaluating Agentic AI Systems

Error Analysis for Agentic AI

AI-AgenticAI/3-1-Agent-Eval-Errors

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🍌 Bananas are berries, but strawberries are not.

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-AgenticAI

Error Analysis in Agentic AI

Learn how Error Analysis helps diagnose failures in Agentic AI systems by identifying bottlenecks, inspecting traces, and measuring component-level performance. Discover practical techniques for root cause analysis, observability, and continuous improvement of AI agents in production.

Artificial Intelligence

Agentic AI

AI Agents

Error Analysis

AI Evaluation

Root Cause Analysis

← Previous

Evaluating Agentic AI Systems

Error Analysis for Agentic AI

Building Effective Evals for Agentic AI Systems

Most AI engineers focus on:

prompts
models
tools
frameworks

But after building several agentic systems, you quickly realize something surprising:

The quality of your evals often matters more than the quality of your prompts.

In traditional software engineering, correctness is usually deterministic.

In Agentic AI, correctness is often probabilistic.

This changes everything.

The most successful teams are not necessarily the ones with the biggest models.

They are often the ones with the most disciplined evaluation processes.

The Biggest Mistake Teams Make

When building an agentic workflow, many developers spend weeks discussing architecture before implementing anything.

A more effective approach is:

graph TD
    A[Build Prototype]
    --> B[Test Real Inputs]
    --> C[Find Failures]
    --> D[Create Evals]
    --> E[Improve System]

The reality is:

You usually do not know what will fail until you see the system running.

This is one of the core differences between traditional software engineering and Agentic AI development.

Why Small Evals Are Fine

Many teams delay evaluation because they believe:

We need thousands of examples.

Usually not.

A practical starting point:

10–20 examples

is often enough to identify:

major failure modes
regression risks
improvement opportunities

You can always expand later.

Evals Evolve With The System

As the workflow improves:

graph TD
    A[Version 1]
    --> B[New Failures Found]
    --> C[Update Eval Set]
    --> D[Version 2]

Your evaluation suite should evolve alongside the agent.

Just like production code.

The Evaluation Flywheel

The most effective teams continuously iterate:

graph TD
    A[Build]
    --> B[Observe]
    --> C[Analyze]
    --> D[Create Eval]
    --> E[Improve]
    --> F[Measure]
    --> A

This cycle repeats indefinitely.

Every improvement produces new insights.

Every failure becomes training data.

Start With a Quick and Dirty Prototype

Suppose you are building an invoice processing agent.

The workflow:

graph TD
    A[Invoice PDF]
    --> B[OCR]
    --> C[LLM Extraction]
    --> D[Database]

The agent extracts:

vendor name
address
amount
due date

The temptation is to immediately design a sophisticated evaluation framework.

Don't.

Instead:

Build a prototype
Run 10-20 invoices
Inspect outputs manually
Identify failure patterns

Error Analysis Comes Before Evals

Error analysis is the process of manually reviewing outputs to identify common failure modes.

Error Analysis

graph TD
    A[Outputs]
    --> B[Manual Review]
    --> C[Identify Failure Pattern]
    --> D[Create Eval]

Error Analysis Workflow

A structured approach looks like this:



graph TD
A[Bad Final Output]
--> B[Inspect Trace]

    B --> C[Identify Weak Component]

    C --> D[Count Frequency]

    D --> E[Prioritize Fixes]

    E --> F[Improve System]

This turns debugging into a data-driven process.

Without error analysis, you risk measuring the wrong thing.

The Four Types of Evals

A useful framework is a 2×2 matrix.

quadrantChart
    title Evaluation Framework
    x-axis No Ground Truth --> Ground Truth
    y-axis Objective --> Subjective

    quadrant-1 Objective + Ground Truth
    quadrant-2 Subjective + Ground Truth
    quadrant-3 Objective + No Ground Truth
    quadrant-4 Subjective + No Ground Truth

1. Objective + Ground Truth

Failures measurable with deterministic rules and per-example labels.

Metric:

Accuracy = \frac{Correct\ Prediction} {Total\ Examples}

Structured outputs make objective evals significantly easier.

Python implementation:


def evaluate(actual, predicted):
    return actual == predicted

# Overall accuracy:
accuracy = correct_predictions / total_examples

Examples:

Invoice extraction
Classification
Named entity extraction

2. Objective + No Ground Truth

Failures measurable with deterministic rules but no per-example labels.

Metric:

rule_compliance

Examples:

Word count limits
JSON validity
Formatting rules

Consider a marketing copy agent have Requirement:

Instagram captions must be 10 words or fewer.

Example outputs:

Stylish sunglasses built for every summer adventure

Word count = 7

Pass.

Evaluation Logic

def word_count(text):
    return len(text.split())

Metric:

count <= 10

No custom label is needed.

Every example shares the same rule.

Accuracy Calculation

Compliance = \frac{Captions\ Under\ 10\ Words} {Total\ Captions}

This is objective evaluation without per-example ground truth.

3. Subjective + Ground Truth

Failures not easily measurable with deterministic rules but with per-example labels.

Metric:

LLM Judge + Gold Labels

Examples:

Research quality
Fact coverage
Content completeness

Suppose the task is:

Write a report about black hole research.

You review outputs and discover:

Important discoveries are frequently omitted.

The challenge:

Different reports may discuss the same topic using different wording.

Simple regex matching becomes unreliable.

1. Gold Standard Talking Points

For each topic:

{
  "topic": "Black Holes",
  "important_points": [
    "Event Horizon",
    "Hawking Radiation",
    "Gravitational Waves",
    "Black Hole Imaging",
    "Accretion Disk Physics"
  ]
}

Now we have domain-specific ground truth.

2. LLM-as-a-Judge

Instead of writing rules:

if "Event Horizon" in report:

we ask another LLM:

Determine how many gold standard points
are present in this report.

Architecture:

graph TD
    A[Research Agent]
    --> B[Generated Report]

    B --> C[Judge LLM]

    C --> D[Coverage Score]

This handles paraphrasing much better than keyword matching.

4. Subjective + No Ground Truth

Examples:

Chart aesthetics
Writing style
Clarity
Creativity

Metric:

LLM Judge + Rubric

This is often the hardest category.

Human-Level Performance as a Benchmark

A useful question:

Where is the agent still worse than an expert human?

Those gaps often reveal the highest-value improvements.

For example:

Task	Human	Agent
Date Extraction	99%	90%
Research Coverage	95%	72%
Policy Compliance	98%	91%

These become your roadmap.

Final Thoughts

The biggest misconception in Agentic AI is that success comes from better prompts.

In practice:

Production\ Success \neq Prompt\ Quality

A more accurate equation is:

Production\ Success = Quality(Workflow) + Quality(Evals) + Iteration\ Speed

The best agentic systems are not built through perfect planning.

They are built through disciplined experimentation, rigorous evaluation, and continuous improvement.

And that is why evaluation is one of the most important skills in modern AI engineering.

Written by Hitesh Sahu, a passionate developer and blogger.

Sun May 31 2026

Share This on

← Previous

Evaluating Agentic AI Systems

Error Analysis for Agentic AI

AI-AgenticAI/3-1-Agent-Eval-Errors