Error Analysis for Agentic AI
Learn how to systematically diagnose, measure, and improve failures in Agentic AI systems using error analysis. Discover how traces, component-level evaluations, root cause analysis, and observability help identify bottlenecks and drive continuous improvement in AI agent performance.
Error Analysis for Agentic AI ๐
How Top Teams Decide What to Fix Next
One of the biggest challenges in building agentic systems is not creating the first version.
It is improving it.
Almost every AI engineer has experienced this:
graph TD
A[Build Workflow ๐ค]
--> B[Test Workflow ๐]
B --> C[Disappointing Results ๐๐ป]
C --> D[Now What?]
The problem is that agentic workflows contain many moving parts.
Without a systematic process, teams often spend weeks optimizing the wrong component.
This is where Error Analysis becomes one of the most valuable skills in Agentic AI engineering.
Why Guessing Fails ๐ค
Many teams optimize based on intuition:
"This feels like a prompt issue."
"This feels like a retrieval issue."
"This feels like the model is weak."
Sometimes they are right.
Often they are not.
The danger is spending months improving a component that contributes very little to overall performance.
The Engineering Mindset ๐ฏ
Strong AI teams think like performance engineers.
Instead of asking:
What can we improve?
They ask:
What should we improve?
Those are very different questions.
The first creates endless work.
The second creates measurable progress.
Traces: The X-Ray of Agentic Systems
To diagnose failures, we need visibility into intermediate outputs.
These intermediate outputs are called:
Trace ๐งพ
A trace contains every step executed by an agent.
One of the most valuable debugging techniques is trace inspection.
A trace records:
- prompts
- intermediate reasoning
- tool calls
- retrieval outputs
- state transitions
- memory updates
Example:
{
"query": "Recent developments in black hole science",
"search_terms": [
"event horizon telescope discoveries",
"black hole imaging research"
],
"search_results": [
"https://astrokidnews.com/...",
"https://spacefunblog.com/..."
],
"selected_sources": [
"...",
"..."
],
"summary": "..."
}
Instead of examining only the final answer, we inspect every intermediate step.
Span vs Trace
Two terms frequently appear in observability systems.
Span
A single step.
Example:
- Generate Search Terms
- Fetch Web Results
Trace
The complete execution path.
Example: Search Workflow:
graph LR
A[Search Terms]
--> B[Search Results]
--> C[Selected Sources]
--> D[Summary]
A trace is simply a collection of spans.
The Error Analysis Flywheel
The best teams repeatedly execute:
graph TD
A[Build ๐ค]
--> B[Observe ๐]
B --> C[Trace Analysis ๐]
C --> D[Identify Bottleneck โ ๏ธ]
D --> E[Improve Component ๐จ]
E --> F[Measure Again ๐]
F --> A
Each iteration makes the system incrementally better.
Error Analysis Workflow
A structured approach looks like this:
graph TD
A[Bad Final Output ๐๐ป]
--> B[Inspect Trace ๐]
B --> C[Identify Weak Component โ ๏ธ]
C --> D[Count Frequency ๐]
D --> E[Prioritize Fixes ๐]
E --> F[Improve System ๐]
This turns debugging into a data-driven process.
Practical Example
Suppose we ask a research agent:
Write a report on recent developments in black hole science.
The generated report misses several important discoveries.
At first glance:
Output Quality = Poor
But that tells us nothing about why.
The root cause could exist anywhere in the workflow.
graph TD
A[User Query] --> B[Generate Search Terms]
B --> C[Web Search]
C --> D[Select Best Sources]
D --> E[Fetch Documents]
E --> F[Summarize]
F --> G[Generate Report]
Suppose our agent generated:
1. Search Terms
Black hole theories Einstein
Event Horizon Telescope radio
Question:
Would a human expert use these search terms?
If yes:
Search Term Generation = Good
Move on.
2. Search Results
Returned URLs:
AstroKidNews
SpaceFunBlog
SpaceBot2000
Question:
Would a human researcher use these sources?
Probably not.
A human would likely prefer:
- Nature
- Science
- arXiv
- NASA
- ESA
Now we have a clue.
Search Results = Weak
The problem may not be the search terms.
The problem may be the search engine or ranking strategy.
Focus Only on Failures
One subtle but important recommendation:
Do not waste time analyzing successful runs.
Suppose:
| Run | Result |
|---|---|
| 1 | Good |
| 2 | Good |
| 3 | Poor |
| 4 | Good |
| 5 | Poor |
Focus on:
Run 3
Run 5
These contain the information you need.
This is why it is called:
Error Analysis
The goal is understanding failure modes.
Build an Error Analysis Spreadsheet
One of the simplest and most effective tools is Excel or Google Sheets.
Example:
| Query | Search Terms | Search Results | Source Selection | Final Output |
|---|---|---|---|---|
| Black Holes | Good | Bad | Good | Bad |
| Seattle Housing | Good | Bad | Good | Bad |
| Fruit Harvesting Robots | Bad | Bad | Bad | Bad |
Now count failures.
Example Statistics
| Component | Error Rate |
|---|---|
| Search Terms | 5% |
| Search Results | 45% |
| Source Selection | 10% |
| Summarization | 8% |
This immediately tells us:
Search Results are the largest source of failure.
Quantifying Error Rates
If:
- 100 traces analyzed
- 45 traces contain poor search results
Then:
This provides objective evidence.
Instead of:
"I think search is the issue."
you can say:
"Search contributes to 45% of failures."
Prioritization Matrix
Not every problem deserves immediate attention.
A useful framework is:
| Component | Error Rate | Easy to Fix? |
|---|---|---|
| Search Terms | 5% | Yes |
| Search Results | 45% | Yes |
| Summarization | 8% | No |
Prioritize:
High Error Rate
+
Easy Improvement
This often delivers the largest gains.
Example Improvements
After identifying weak search results, possible fixes include:
1. Better Search Provider
Replace:
Search Engine A
with:
Search Engine B
2. Improved Ranking
Add:
rerank_results()
before source selection.
3. Domain Filtering
Restrict searches to:
nature.com
science.org
arxiv.org
nasa.gov
These targeted improvements are only possible because error analysis revealed the bottleneck.
Error Analysis vs Prompt Engineering
Many beginners immediately modify prompts.
But consider:
Bad Search Results
Will a better summarization prompt help?
Probably not.
The summarizer can only work with the information it receives.
Error analysis prevents optimization in the wrong place.
Final Thoughts
Agentic systems contain many components:
- planners
- retrievers
- search engines
- evaluators
- memory systems
- tool callers
- generators
When performance is poor, almost any component could be responsible.
Error analysis provides a systematic way to identify:
- Which component is failing
- How often it fails
- Whether it is worth fixing
Without error analysis:
With error analysis:
And that distinction often determines whether an AI team improves a system in days or spends months optimizing the wrong thing.
