Hitesh Sahu
Hitesh SahuHitesh Sahu
  1. Home
  2. ›
  3. posts
  4. ›
  5. …

  6. ›
  7. 4 1 Agent Code Execution

Loading ⏳
Fetching content, this won’t take long…


💡 Did you know?

🍌 Bananas are berries, but strawberries are not.

🍪 This website uses cookies

No personal data is stored on our servers however third party tools Google Analytics cookies to measure traffic and improve your website experience. Learn more

Cover Image for Code Execution in Agentic AI

Code Execution in Agentic AI

Learn how Agentic AI systems generate, execute, and refine code to solve complex problems, perform calculations, automate workflows, and interact with external systems. Explore execution loops, self-correction, sandboxing, and the role of code execution in building powerful autonomous AI agents.

Hitesh Sahu
Written by Hitesh Sahu, a passionate developer and blogger.

Sun May 31 2026

Share This on

← Previous

Tool Use in Agentic AI

Next →

Understanding the Model Context Protocol (MCP)

Code Execution in Agentic AI

When LLMs Become Programmers

One of the most powerful capabilities in modern Agentic AI systems is not planning, retrieval, or tool use.

It is code execution.

When developers first learn about tool calling, they often create individual tools for every possible operation:

add()
subtract()
multiply()
divide()

This works initially.

But very quickly the approach breaks down.

What happens when users ask more advance questions?

What is the square root of 2?

Now we need:

sqrt()
power()
log()
sin()
cos()
tan()

Eventually, we end up recreating an entire scientific calculator as individual tools.

There is a better approach.

Instead of creating thousands of tools:

Let the LLM write code.

Why Code Execution Matters

Traditional tool use looks like:

graph TD
    A[User Question]
    --> B[LLM]

    B --> C[Specific Tool]

    C --> D[Result]

    D --> B

    B --> E[Answer]

Each new capability requires another tool.

Code execution changes the architecture.

graph TD
    A[User Question]
    --> B[LLM]

    B --> C[Generate Python]

    C --> D[Execute Code]

    D --> E[Execution Result]

    E --> B

    B --> F[Answer]

The LLM effectively creates its own tools on demand.

Example: Solving Mathematical Problems

Suppose a user asks:

What is the square root of 2?

Instead of calling a predefined tool, the model generates:

import math

print(math.sqrt(2))

The system executes the code and obtains:

1.41421356237

The result is returned to the model.

The model then responds:

The square root of 2 is approximately 1.4142.

This enables virtually unlimited computational capabilities without requiring a custom function for every operation.

Building a Code Execution Agent

The basic workflow consists of four stages.

sequenceDiagram
    participant User
    participant LLM
    participant Executor

    User->>LLM: Solve math problem

    LLM->>Executor: Generated Python

    Executor-->>LLM: Result

    LLM-->>User: Final Answer

Prompting the Model

A common pattern is instructing the model to emit executable code.

SYSTEM_PROMPT = """
    Write Python code to solve the problem.
    
    Wrap all code inside:
    
    <execute_python>
    ...
    </execute_python>
"""

User:

What is the square root of 2?

Model:

<execute_python>
    import math
    
    print(math.sqrt(2))
</execute_python>

Extracting the Generated Code

The application extracts the code block.

import re

pattern = (
    r"<execute_python>(.*?)</execute_python>"
)

match = re.search(
    pattern,
    llm_output,
    re.DOTALL
)

code = match.group(1)

The extracted code can then be executed.

Executing Python

The simplest approach uses Python's built-in execution engine.

exec(code)

This allows arbitrary code generation.

However:

It also introduces risk.

The generated code may not always be safe.

Beyond Simple Math

Code execution becomes far more interesting for complex calculations.

Consider compound interest.

User:

If I invest $500 monthly for 10 years
at 8% annual return,
what will I have?

Generated code:

r = 0.08 / 12
n = 120
payment = 500

future_value = (
    payment *
    (((1 + r) ** n - 1) / r)
)

print(round(future_value, 2))

The model can solve problems that would otherwise require dozens of specialized financial tools.

Reflection Improves Reliability

What happens if the generated code fails?

Example:

print(math.squrt(2))

Output:

AttributeError

Rather than stopping, we can implement a reflection loop.

graph TD
    A[Generate Code]
    --> B[Execute]

    B --> C{Success?}

    C -->|No| D[Return Error]

    D --> E[Regenerate Code]

    E --> B

    C -->|Yes| F[Final Answer]

The error message becomes feedback.

This is one of the most common patterns in coding agents today.

Implementing Self-Correction

try:
    exec(code)

except Exception as e:

    error = str(e)

    prompt = f"""
    Fix this Python code.

    Error:
    {error}

    Code:
    {code}
    """

The model can then produce a corrected version automatically.

Why Coding Agents Are So Effective

Traditional tool systems are constrained by available tools.

Code execution creates a universal tool.

Mathematically:

Capabilities=∑i=1NToolsiCapabilities = \sum_{i=1}^{N} Tools_iCapabilities=i=1∑N​Toolsi​

Traditional systems increase capabilities by adding tools.

Code execution changes the equation:

Capabilities≈PythonCapabilities \approx PythonCapabilities≈Python

Since Python can:

  • manipulate files
  • process images
  • analyze data
  • call APIs
  • perform calculations

the capability space expands dramatically.

The Security Problem

There is an obvious concern.

What if the generated code is harmful?

For example:

import os

os.remove("important_file.py")

Or worse:

import shutil

shutil.rmtree("/")

The model does not inherently understand the consequences of execution.

This creates one of the largest risks in Agentic AI systems.

Real-World Example

Coding agents occasionally generate dangerous commands.

Examples include:

  • deleting project files
  • modifying repositories
  • overwriting configurations
  • removing source code

Even highly capable models can make mistakes.

The lesson highlights a real-world incident where an agent deleted Python files inside a project before realizing the error. Fortunately, the repository was backed up in version control.

Sandboxing: The Correct Approach

Best practice is never to execute arbitrary code directly on production systems.

Instead:

graph TD
    A[Generated Code]
    --> B[Sandbox]

    B --> C[Execute]

    C --> D[Result]

    D --> E[LLM]

The sandbox isolates execution from:

  • sensitive files
  • production databases
  • customer data
  • infrastructure systems

Popular Sandboxing Options

1. Docker

docker run python:3.12

Provides:

  • filesystem isolation
  • process isolation
  • resource limits

2. E2B

Provides lightweight cloud sandboxes specifically designed for AI agents.

Useful for:

  • coding assistants
  • data analysis agents
  • autonomous workflows

The lesson specifically recommends sandboxed environments such as Docker or E2B for safer execution.


Tool Use vs Code Execution

Tool use:

Known Functions

Code execution:

Generate New Functions Dynamically

Comparison:

Capability Tool Use Code Execution
Fixed actions ✓ ✓
Unlimited calculations ✗ ✓
Dynamic algorithms ✗ ✓
Data analysis Limited Excellent
Security risk Medium High
Flexibility Medium Very High

The Bigger Shift

The progression of agent capabilities often looks like this:

graph TD
    A[Prompting]
    --> B[Tool Use]
    --> C[Code Execution]
    --> D[Autonomous Agents]

Tool use gave LLMs access to external systems.

Code execution gives them access to programmable computation.

And that combination is what powers many of today's most capable AI agents.

Why Frontier Models Are Trained for Code Execution

Many modern models receive specialized training for:

  • code generation
  • debugging
  • tool usage
  • reflection loops
  • execution feedback

This is because code execution dramatically improves reasoning performance.

The model no longer has to compute everything mentally.

Instead:

Reasoning+ComputationReasoning + ComputationReasoning+Computation

becomes

Reasoning→Code→ExecutionReasoning \rightarrow Code \rightarrow ExecutionReasoning→Code→Execution

which is often significantly more reliable.


Final Thoughts

Code execution is one of the most important capabilities in Agentic AI.

Instead of building hundreds of specialized tools, we allow the model to create its own solution dynamically.

The result is a system capable of:

  • advanced mathematics
  • financial modeling
  • data processing
  • algorithmic reasoning
  • workflow automation

However, with that power comes responsibility.

The best production systems combine:

Code Execution+Reflection+Sandboxing+EvaluationCode\ Execution + Reflection + Sandboxing + EvaluationCode Execution+Reflection+Sandboxing+Evaluation

to create agents that are both powerful and safe.

As agentic systems continue to evolve, code execution will likely remain one of the foundational capabilities separating simple chatbots from truly autonomous AI systems.

AI-AgenticAI/4-1-Agent-Code-Execution
Let's work together
+49 176-2019-2523
hiteshkrsahu@gmail.com
WhatsApp
Skype
Munich 🥨, Germany 🇩🇪, EU
Playstore
Hitesh Sahu's apps on Google Play Store
Need Help?
Let's Connect
Navigation
  Home/About
  Skills
  Work/Projects
  Lab/Experiments
  Contribution
  Awards
  Art/Sketches
  Thoughts
  Contact
Links
  Sitemap
  Legal Notice
  Privacy Policy

Made with

NextJS logo

NextJS by

hitesh Sahu

| © 2026 All rights reserved.