Code Execution in Agentic AI
Learn how Agentic AI systems generate, execute, and refine code to solve complex problems, perform calculations, automate workflows, and interact with external systems. Explore execution loops, self-correction, sandboxing, and the role of code execution in building powerful autonomous AI agents.
Code Execution in Agentic AI
When LLMs Become Programmers
One of the most powerful capabilities in modern Agentic AI systems is not planning, retrieval, or tool use.
It is code execution.
When developers first learn about tool calling, they often create individual tools for every possible operation:
add()
subtract()
multiply()
divide()
This works initially.
But very quickly the approach breaks down.
What happens when users ask more advance questions?
What is the square root of 2?
Now we need:
sqrt()
power()
log()
sin()
cos()
tan()
Eventually, we end up recreating an entire scientific calculator as individual tools.
There is a better approach.
Instead of creating thousands of tools:
Let the LLM write code.
Why Code Execution Matters
Traditional tool use looks like:
graph TD
A[User Question]
--> B[LLM]
B --> C[Specific Tool]
C --> D[Result]
D --> B
B --> E[Answer]
Each new capability requires another tool.
Code execution changes the architecture.
graph TD
A[User Question]
--> B[LLM]
B --> C[Generate Python]
C --> D[Execute Code]
D --> E[Execution Result]
E --> B
B --> F[Answer]
The LLM effectively creates its own tools on demand.
Example: Solving Mathematical Problems
Suppose a user asks:
What is the square root of 2?
Instead of calling a predefined tool, the model generates:
import math
print(math.sqrt(2))
The system executes the code and obtains:
1.41421356237
The result is returned to the model.
The model then responds:
The square root of 2 is approximately 1.4142.
This enables virtually unlimited computational capabilities without requiring a custom function for every operation.
Building a Code Execution Agent
The basic workflow consists of four stages.
sequenceDiagram
participant User
participant LLM
participant Executor
User->>LLM: Solve math problem
LLM->>Executor: Generated Python
Executor-->>LLM: Result
LLM-->>User: Final Answer
Prompting the Model
A common pattern is instructing the model to emit executable code.
SYSTEM_PROMPT = """
Write Python code to solve the problem.
Wrap all code inside:
<execute_python>
...
</execute_python>
"""
User:
What is the square root of 2?
Model:
<execute_python>
import math
print(math.sqrt(2))
</execute_python>
Extracting the Generated Code
The application extracts the code block.
import re
pattern = (
r"<execute_python>(.*?)</execute_python>"
)
match = re.search(
pattern,
llm_output,
re.DOTALL
)
code = match.group(1)
The extracted code can then be executed.
Executing Python
The simplest approach uses Python's built-in execution engine.
exec(code)
This allows arbitrary code generation.
However:
It also introduces risk.
The generated code may not always be safe.
Beyond Simple Math
Code execution becomes far more interesting for complex calculations.
Consider compound interest.
User:
If I invest $500 monthly for 10 years
at 8% annual return,
what will I have?
Generated code:
r = 0.08 / 12
n = 120
payment = 500
future_value = (
payment *
(((1 + r) ** n - 1) / r)
)
print(round(future_value, 2))
The model can solve problems that would otherwise require dozens of specialized financial tools.
Reflection Improves Reliability
What happens if the generated code fails?
Example:
print(math.squrt(2))
Output:
AttributeError
Rather than stopping, we can implement a reflection loop.
graph TD
A[Generate Code]
--> B[Execute]
B --> C{Success?}
C -->|No| D[Return Error]
D --> E[Regenerate Code]
E --> B
C -->|Yes| F[Final Answer]
The error message becomes feedback.
This is one of the most common patterns in coding agents today.
Implementing Self-Correction
try:
exec(code)
except Exception as e:
error = str(e)
prompt = f"""
Fix this Python code.
Error:
{error}
Code:
{code}
"""
The model can then produce a corrected version automatically.
Why Coding Agents Are So Effective
Traditional tool systems are constrained by available tools.
Code execution creates a universal tool.
Mathematically:
Traditional systems increase capabilities by adding tools.
Code execution changes the equation:
Since Python can:
- manipulate files
- process images
- analyze data
- call APIs
- perform calculations
the capability space expands dramatically.
The Security Problem
There is an obvious concern.
What if the generated code is harmful?
For example:
import os
os.remove("important_file.py")
Or worse:
import shutil
shutil.rmtree("/")
The model does not inherently understand the consequences of execution.
This creates one of the largest risks in Agentic AI systems.
Real-World Example
Coding agents occasionally generate dangerous commands.
Examples include:
- deleting project files
- modifying repositories
- overwriting configurations
- removing source code
Even highly capable models can make mistakes.
The lesson highlights a real-world incident where an agent deleted Python files inside a project before realizing the error. Fortunately, the repository was backed up in version control.
Sandboxing: The Correct Approach
Best practice is never to execute arbitrary code directly on production systems.
Instead:
graph TD
A[Generated Code]
--> B[Sandbox]
B --> C[Execute]
C --> D[Result]
D --> E[LLM]
The sandbox isolates execution from:
- sensitive files
- production databases
- customer data
- infrastructure systems
Popular Sandboxing Options
1. Docker
docker run python:3.12
Provides:
- filesystem isolation
- process isolation
- resource limits
2. E2B
Provides lightweight cloud sandboxes specifically designed for AI agents.
Useful for:
- coding assistants
- data analysis agents
- autonomous workflows
The lesson specifically recommends sandboxed environments such as Docker or E2B for safer execution.
Tool Use vs Code Execution
Tool use:
Known Functions
Code execution:
Generate New Functions Dynamically
Comparison:
| Capability | Tool Use | Code Execution |
|---|---|---|
| Fixed actions | ✓ | ✓ |
| Unlimited calculations | ✗ | ✓ |
| Dynamic algorithms | ✗ | ✓ |
| Data analysis | Limited | Excellent |
| Security risk | Medium | High |
| Flexibility | Medium | Very High |
The Bigger Shift
The progression of agent capabilities often looks like this:
graph TD
A[Prompting]
--> B[Tool Use]
--> C[Code Execution]
--> D[Autonomous Agents]
Tool use gave LLMs access to external systems.
Code execution gives them access to programmable computation.
And that combination is what powers many of today's most capable AI agents.
Why Frontier Models Are Trained for Code Execution
Many modern models receive specialized training for:
- code generation
- debugging
- tool usage
- reflection loops
- execution feedback
This is because code execution dramatically improves reasoning performance.
The model no longer has to compute everything mentally.
Instead:
becomes
which is often significantly more reliable.
Final Thoughts
Code execution is one of the most important capabilities in Agentic AI.
Instead of building hundreds of specialized tools, we allow the model to create its own solution dynamically.
The result is a system capable of:
- advanced mathematics
- financial modeling
- data processing
- algorithmic reasoning
- workflow automation
However, with that power comes responsibility.
The best production systems combine:
to create agents that are both powerful and safe.
As agentic systems continue to evolve, code execution will likely remain one of the foundational capabilities separating simple chatbots from truly autonomous AI systems.
