Error Analysis in Agentic AI
Learn how Error Analysis helps diagnose failures in Agentic AI systems by identifying bottlenecks, inspecting traces, and measuring component-level performance. Discover practical techniques for root cause analysis, observability, and continuous improvement of AI agents in production.
Building Effective Evals for Agentic AI Systems
Most AI engineers focus on:
- prompts
- models
- tools
- frameworks
But after building several agentic systems, you quickly realize something surprising:
The quality of your evals often matters more than the quality of your prompts.
In traditional software engineering, correctness is usually deterministic.
In Agentic AI, correctness is often probabilistic.
This changes everything.
The most successful teams are not necessarily the ones with the biggest models.
They are often the ones with the most disciplined evaluation processes.
The Biggest Mistake Teams Make
When building an agentic workflow, many developers spend weeks discussing architecture before implementing anything.
A more effective approach is:
graph TD
A[Build Prototype]
--> B[Test Real Inputs]
--> C[Find Failures]
--> D[Create Evals]
--> E[Improve System]
The reality is:
You usually do not know what will fail until you see the system running.
This is one of the core differences between traditional software engineering and Agentic AI development.
Why Small Evals Are Fine
Many teams delay evaluation because they believe:
We need thousands of examples.
Usually not.
A practical starting point:
10–20 examples
is often enough to identify:
- major failure modes
- regression risks
- improvement opportunities
You can always expand later.
Evals Evolve With The System
As the workflow improves:
graph TD
A[Version 1]
--> B[New Failures Found]
--> C[Update Eval Set]
--> D[Version 2]
Your evaluation suite should evolve alongside the agent.
Just like production code.
The Evaluation Flywheel
The most effective teams continuously iterate:
graph TD
A[Build]
--> B[Observe]
--> C[Analyze]
--> D[Create Eval]
--> E[Improve]
--> F[Measure]
--> A
This cycle repeats indefinitely.
Every improvement produces new insights.
Every failure becomes training data.
Start With a Quick and Dirty Prototype
Suppose you are building an invoice processing agent.
The workflow:
graph TD
A[Invoice PDF]
--> B[OCR]
--> C[LLM Extraction]
--> D[Database]
The agent extracts:
- vendor name
- address
- amount
- due date
The temptation is to immediately design a sophisticated evaluation framework.
Don't.
Instead:
- Build a prototype
- Run 10-20 invoices
- Inspect outputs manually
- Identify failure patterns
Error Analysis Comes Before Evals
Error analysis is the process of manually reviewing outputs to identify common failure modes.
Error Analysis
graph TD
A[Outputs]
--> B[Manual Review]
--> C[Identify Failure Pattern]
--> D[Create Eval]
Error Analysis Workflow
A structured approach looks like this:
graph TD
A[Bad Final Output]
--> B[Inspect Trace]
B --> C[Identify Weak Component]
C --> D[Count Frequency]
D --> E[Prioritize Fixes]
E --> F[Improve System]
This turns debugging into a data-driven process.
Without error analysis, you risk measuring the wrong thing.
The Four Types of Evals
A useful framework is a 2×2 matrix.
quadrantChart
title Evaluation Framework
x-axis No Ground Truth --> Ground Truth
y-axis Objective --> Subjective
quadrant-1 Objective + Ground Truth
quadrant-2 Subjective + Ground Truth
quadrant-3 Objective + No Ground Truth
quadrant-4 Subjective + No Ground Truth
1. Objective + Ground Truth
Failures measurable with deterministic rules and per-example labels.
Metric:
Structured outputs make objective evals significantly easier.
Python implementation:
def evaluate(actual, predicted):
return actual == predicted
# Overall accuracy:
accuracy = correct_predictions / total_examples
Examples:
- Invoice extraction
- Classification
- Named entity extraction
2. Objective + No Ground Truth
Failures measurable with deterministic rules but no per-example labels.
Metric:
rule_compliance
Examples:
- Word count limits
- JSON validity
- Formatting rules
Consider a marketing copy agent have Requirement:
Instagram captions must be 10 words or fewer.
Example outputs:
Stylish sunglasses built for every summer adventure
Word count = 7
Pass.
Evaluation Logic
def word_count(text):
return len(text.split())
Metric:
count <= 10
No custom label is needed.
Every example shares the same rule.
Accuracy Calculation
This is objective evaluation without per-example ground truth.
3. Subjective + Ground Truth
Failures not easily measurable with deterministic rules but with per-example labels.
Metric:
LLM Judge + Gold Labels
Examples:
- Research quality
- Fact coverage
- Content completeness
Suppose the task is:
Write a report about black hole research.
You review outputs and discover:
Important discoveries are frequently omitted.
The challenge:
Different reports may discuss the same topic using different wording.
Simple regex matching becomes unreliable.
1. Gold Standard Talking Points
For each topic:
{
"topic": "Black Holes",
"important_points": [
"Event Horizon",
"Hawking Radiation",
"Gravitational Waves",
"Black Hole Imaging",
"Accretion Disk Physics"
]
}
Now we have domain-specific ground truth.
2. LLM-as-a-Judge
Instead of writing rules:
if "Event Horizon" in report:
we ask another LLM:
Determine how many gold standard points
are present in this report.
Architecture:
graph TD
A[Research Agent]
--> B[Generated Report]
B --> C[Judge LLM]
C --> D[Coverage Score]
This handles paraphrasing much better than keyword matching.
4. Subjective + No Ground Truth
Examples:
- Chart aesthetics
- Writing style
- Clarity
- Creativity
Metric:
LLM Judge + Rubric
This is often the hardest category.
Human-Level Performance as a Benchmark
A useful question:
Where is the agent still worse than an expert human?
Those gaps often reveal the highest-value improvements.
For example:
| Task | Human | Agent |
|---|---|---|
| Date Extraction | 99% | 90% |
| Research Coverage | 95% | 72% |
| Policy Compliance | 98% | 91% |
These become your roadmap.
Final Thoughts
The biggest misconception in Agentic AI is that success comes from better prompts.
In practice:
A more accurate equation is:
The best agentic systems are not built through perfect planning.
They are built through disciplined experimentation, rigorous evaluation, and continuous improvement.
And that is why evaluation is one of the most important skills in modern AI engineering.
