Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🍌 Bananas are berries, but strawberries are not.

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🤯 Your stomach gets a new lining every 3–4 days.

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-AgenticAI

Code Execution in Agentic AI

Learn how Agentic AI systems generate, execute, and refine code to solve complex problems, perform calculations, automate workflows, and interact with external systems. Explore execution loops, self-correction, sandboxing, and the role of code execution in building powerful autonomous AI agents.

Artificial Intelligence

Agentic AI

AI Agents

Code Execution

Python

Large Language Models

← Previous

Tool Use in Agentic AI

Understanding the Model Context Protocol (MCP)

Code Execution in Agentic AI

Build your own Tools

When LLMs Become Programmers

One of the most powerful capabilities in modern Agentic AI systems is not planning, retrieval, or tool use.

It is code execution.

When developers first learn about tool calling, they often create individual tools for every possible operation:

add()
subtract()
multiply()
divide()

This works initially.

But very quickly the approach breaks down.

What happens when users ask more advance questions?

What is the square root of 2?

Now we need:

sqrt()
power()
log()
sin()
cos()
tan()

Eventually, we end up recreating an entire scientific calculator as individual tools.

There is a better approach.

Instead of creating thousands of tools:

Let the LLM write code.

Why Code Execution Matters

Traditional tool use looks like:

graph TD
    A[User Question]
    --> B[LLM]

    B --> C[Specific Tool]

    C --> D[Result]

    D --> B

    B --> E[Answer]

Each new capability requires another tool.

Code execution changes the architecture.

graph TD
    A[User Question]
    --> B[LLM]

    B --> C[Generate Python]

    C --> D[Execute Code]

    D --> E[Execution Result]

    E --> B

    B --> F[Answer]

The LLM effectively creates its own tools on demand.

Example: Solving Mathematical Problems

Suppose a user asks:

What is the square root of 2?

Instead of calling a predefined tool, the model generates:

import math

print(math.sqrt(2))

The system executes the code and obtains:

1.41421356237

The result is returned to the model.

The model then responds:

The square root of 2 is approximately 1.4142.

This enables virtually unlimited computational capabilities without requiring a custom function for every operation.

Building a Code Execution Agent

The basic workflow consists of four stages.

sequenceDiagram
    participant User
    participant LLM
    participant Executor

    User->>LLM: Solve math problem

    LLM->>Executor: Generated Python

    Executor-->>LLM: Result

    LLM-->>User: Final Answer

Prompting the Model

A common pattern is instructing the model to emit executable code.

SYSTEM_PROMPT = """
    Write Python code to solve the problem.
    
    Wrap all code inside:
    
    <execute_python>
    ...
    </execute_python>
"""

User:

What is the square root of 2?

Model:

<execute_python>
    import math
    
    print(math.sqrt(2))
</execute_python>

Extracting the Generated Code

The application extracts the code block.

import re

pattern = (
    r"<execute_python>(.*?)</execute_python>"
)

match = re.search(
    pattern,
    llm_output,
    re.DOTALL
)

code = match.group(1)

The extracted code can then be executed.

Executing Python

The simplest approach uses Python's built-in execution engine.

exec(code)

This allows arbitrary code generation.

However:

It also introduces risk.

The generated code may not always be safe.

Beyond Simple Math

Code execution becomes far more interesting for complex calculations.

Consider compound interest.

User:

If I invest $500 monthly for 10 years
at 8% annual return,
what will I have?

Generated code:

r = 0.08 / 12
n = 120
payment = 500

future_value = (
    payment *
    (((1 + r) ** n - 1) / r)
)

print(round(future_value, 2))

The model can solve problems that would otherwise require dozens of specialized financial tools.

Reflection Improves Reliability

What happens if the generated code fails?

Example:

print(math.squrt(2))

Output:

AttributeError

Rather than stopping, we can implement a reflection loop.

graph TD
    A[Generate Code]
    --> B[Execute]

    B --> C{Success?}

    C -->|No| D[Return Error]

    D --> E[Regenerate Code]

    E --> B

    C -->|Yes| F[Final Answer]

The error message becomes feedback.

This is one of the most common patterns in coding agents today.

Implementing Self-Correction

try:
    exec(code)

except Exception as e:

    error = str(e)

    prompt = f"""
    Fix this Python code.

    Error:
    {error}

    Code:
    {code}
    """

The model can then produce a corrected version automatically.

Why Coding Agents Are So Effective

Traditional tool systems are constrained by available tools.

Code execution creates a universal tool.

Mathematically:

Capabilities = \sum_{i=1}^{N} Tools_i

Traditional systems increase capabilities by adding tools.

Code execution changes the equation:

Capabilities \approx Python

Since Python can:

manipulate files
process images
analyze data
call APIs
perform calculations

the capability space expands dramatically.

The Security Problem

There is an obvious concern.

What if the generated code is harmful?

For example:

import os

os.remove("important_file.py")

Or worse:

import shutil

shutil.rmtree("/")

The model does not inherently understand the consequences of execution.

This creates one of the largest risks in Agentic AI systems.

Real-World Example

Coding agents occasionally generate dangerous commands.

Examples include:

deleting project files
modifying repositories
overwriting configurations
removing source code

Even highly capable models can make mistakes.

The lesson highlights a real-world incident where an agent deleted Python files inside a project before realizing the error. Fortunately, the repository was backed up in version control.

Sandboxing: The Correct Approach

Best practice is never to execute arbitrary code directly on production systems.

Instead:

graph TD
    A[Generated Code]
    --> B[Sandbox]

    B --> C[Execute]

    C --> D[Result]

    D --> E[LLM]

The sandbox isolates execution from:

sensitive files
production databases
customer data
infrastructure systems

Popular Sandboxing Options

1. Docker

Run code in Docker container locally

Provides:

filesystem isolation
process isolation
resource limits

Example:

docker run python:3.12

2. E2B

Open-source infrastructure that allows you to run AI-generated code in secure isolated sandboxes in the cloud.

Provides lightweight cloud sandboxes specifically designed for AI agents.

Useful for:

Coding assistants
Data analysis agents
Autonomous workflows

Tool Use vs Code Execution

Tool use:

Known Functions

Code execution:

Generate New Functions Dynamically

Comparison:

Capability	Tool Use	Code Execution
Fixed actions	✓	✓
Unlimited calculations	✗	✓
Dynamic algorithms	✗	✓
Data analysis	Limited	Excellent
Security risk	Medium	High
Flexibility	Medium	Very High

The Bigger Shift

The progression of agent capabilities often looks like this:

graph TD
    A[Prompting]
    --> B[Tool Use]
    --> C[Code Execution]
    --> D[Autonomous Agents]

Tool use gave LLMs access to external systems.

Code execution gives them access to programmable computation.

And that combination is what powers many of today's most capable AI agents.

Why Frontier Models Are Trained for Code Execution

Many modern models receive specialized training for:

code generation
debugging
tool usage
reflection loops
execution feedback

This is because code execution dramatically improves reasoning performance.

The model no longer has to compute everything mentally.

Instead:

Reasoning + Computation

becomes

Reasoning \rightarrow Code \rightarrow Execution

which is often significantly more reliable.

Final Thoughts

Code execution is one of the most important capabilities in Agentic AI.

Instead of building hundreds of specialized tools, we allow the model to create its own solution dynamically.

The result is a system capable of:

Advanced mathematics
Financial modeling
Data processing
Algorithmic reasoning
Workflow automation

However, with that power comes responsibility.

The best production systems combine:

Code\ Execution+ Reflection+ Sandboxing+ Evaluation

to create agents that are both powerful and safe.

As agentic systems continue to evolve, code execution will likely remain one of the foundational capabilities separating simple chatbots from truly autonomous AI systems.

Written by Hitesh Sahu, a passionate developer and blogger.

Sun May 31 2026

Share This on

← Previous

Tool Use in Agentic AI

Understanding the Model Context Protocol (MCP)

AI-AgenticAI/4-1-Agent-Code-Execution

Loading ⏳

Fetching content, this won’t take long…

💡 Did you know?

🍌 Bananas are berries, but strawberries are not.

AI-AgenticAI

AI-DeepLearning

AI-GenAI

AI-Infrastructure

AI-Machine-Learning

AI-Math

AWS

Azure

Hobbies

kubernetes

Management

Programming

Terraform

Z_Appendix

0-root

AI-AgenticAI

Code Execution in Agentic AI

Learn how Agentic AI systems generate, execute, and refine code to solve complex problems, perform calculations, automate workflows, and interact with external systems. Explore execution loops, self-correction, sandboxing, and the role of code execution in building powerful autonomous AI agents.

Artificial Intelligence

Agentic AI

AI Agents

Code Execution

Python

Large Language Models

← Previous

Tool Use in Agentic AI

Understanding the Model Context Protocol (MCP)

Code Execution in Agentic AI

Build your own Tools

When LLMs Become Programmers

One of the most powerful capabilities in modern Agentic AI systems is not planning, retrieval, or tool use.

It is code execution.

When developers first learn about tool calling, they often create individual tools for every possible operation:

add()
subtract()
multiply()
divide()

This works initially.

But very quickly the approach breaks down.

What happens when users ask more advance questions?

What is the square root of 2?

Now we need:

sqrt()
power()
log()
sin()
cos()
tan()

Eventually, we end up recreating an entire scientific calculator as individual tools.

There is a better approach.

Instead of creating thousands of tools:

Let the LLM write code.

Why Code Execution Matters

Traditional tool use looks like:

graph TD
    A[User Question]
    --> B[LLM]

    B --> C[Specific Tool]

    C --> D[Result]

    D --> B

    B --> E[Answer]

Each new capability requires another tool.

Code execution changes the architecture.

graph TD
    A[User Question]
    --> B[LLM]

    B --> C[Generate Python]

    C --> D[Execute Code]

    D --> E[Execution Result]

    E --> B

    B --> F[Answer]

The LLM effectively creates its own tools on demand.

Example: Solving Mathematical Problems

Suppose a user asks:

What is the square root of 2?

Instead of calling a predefined tool, the model generates:

import math

print(math.sqrt(2))

The system executes the code and obtains:

1.41421356237

The result is returned to the model.

The model then responds:

The square root of 2 is approximately 1.4142.

This enables virtually unlimited computational capabilities without requiring a custom function for every operation.

Building a Code Execution Agent

The basic workflow consists of four stages.

sequenceDiagram
    participant User
    participant LLM
    participant Executor

    User->>LLM: Solve math problem

    LLM->>Executor: Generated Python

    Executor-->>LLM: Result

    LLM-->>User: Final Answer

Prompting the Model

A common pattern is instructing the model to emit executable code.

SYSTEM_PROMPT = """
    Write Python code to solve the problem.
    
    Wrap all code inside:
    
    <execute_python>
    ...
    </execute_python>
"""

User:

What is the square root of 2?

Model:

<execute_python>
    import math
    
    print(math.sqrt(2))
</execute_python>

Extracting the Generated Code

The application extracts the code block.

import re

pattern = (
    r"<execute_python>(.*?)</execute_python>"
)

match = re.search(
    pattern,
    llm_output,
    re.DOTALL
)

code = match.group(1)

The extracted code can then be executed.

Executing Python

The simplest approach uses Python's built-in execution engine.

exec(code)

This allows arbitrary code generation.

However:

It also introduces risk.

The generated code may not always be safe.

Beyond Simple Math

Code execution becomes far more interesting for complex calculations.

Consider compound interest.

User:

If I invest $500 monthly for 10 years
at 8% annual return,
what will I have?

Generated code:

r = 0.08 / 12
n = 120
payment = 500

future_value = (
    payment *
    (((1 + r) ** n - 1) / r)
)

print(round(future_value, 2))

The model can solve problems that would otherwise require dozens of specialized financial tools.

Reflection Improves Reliability

What happens if the generated code fails?

Example:

print(math.squrt(2))

Output:

AttributeError

Rather than stopping, we can implement a reflection loop.

graph TD
    A[Generate Code]
    --> B[Execute]

    B --> C{Success?}

    C -->|No| D[Return Error]

    D --> E[Regenerate Code]

    E --> B

    C -->|Yes| F[Final Answer]

The error message becomes feedback.

This is one of the most common patterns in coding agents today.

Implementing Self-Correction

try:
    exec(code)

except Exception as e:

    error = str(e)

    prompt = f"""
    Fix this Python code.

    Error:
    {error}

    Code:
    {code}
    """

The model can then produce a corrected version automatically.

Why Coding Agents Are So Effective

Traditional tool systems are constrained by available tools.

Code execution creates a universal tool.

Mathematically:

Capabilities = \sum_{i=1}^{N} Tools_i

Traditional systems increase capabilities by adding tools.

Code execution changes the equation:

Capabilities \approx Python

Since Python can:

manipulate files
process images
analyze data
call APIs
perform calculations

the capability space expands dramatically.

The Security Problem

There is an obvious concern.

What if the generated code is harmful?

For example:

import os

os.remove("important_file.py")

Or worse:

import shutil

shutil.rmtree("/")

The model does not inherently understand the consequences of execution.

This creates one of the largest risks in Agentic AI systems.

Real-World Example

Coding agents occasionally generate dangerous commands.

Examples include:

deleting project files
modifying repositories
overwriting configurations
removing source code

Even highly capable models can make mistakes.

The lesson highlights a real-world incident where an agent deleted Python files inside a project before realizing the error. Fortunately, the repository was backed up in version control.

Sandboxing: The Correct Approach

Best practice is never to execute arbitrary code directly on production systems.

Instead:

graph TD
    A[Generated Code]
    --> B[Sandbox]

    B --> C[Execute]

    C --> D[Result]

    D --> E[LLM]

The sandbox isolates execution from:

sensitive files
production databases
customer data
infrastructure systems

Popular Sandboxing Options

1. Docker

Run code in Docker container locally

Provides:

filesystem isolation
process isolation
resource limits

Example:

docker run python:3.12

2. E2B

Open-source infrastructure that allows you to run AI-generated code in secure isolated sandboxes in the cloud.

Provides lightweight cloud sandboxes specifically designed for AI agents.

Useful for:

Coding assistants
Data analysis agents
Autonomous workflows

Tool Use vs Code Execution

Tool use:

Known Functions

Code execution:

Generate New Functions Dynamically

Comparison:

Capability	Tool Use	Code Execution
Fixed actions	✓	✓
Unlimited calculations	✗	✓
Dynamic algorithms	✗	✓
Data analysis	Limited	Excellent
Security risk	Medium	High
Flexibility	Medium	Very High

The Bigger Shift

The progression of agent capabilities often looks like this:

graph TD
    A[Prompting]
    --> B[Tool Use]
    --> C[Code Execution]
    --> D[Autonomous Agents]

Tool use gave LLMs access to external systems.

Code execution gives them access to programmable computation.

And that combination is what powers many of today's most capable AI agents.

Why Frontier Models Are Trained for Code Execution

Many modern models receive specialized training for:

code generation
debugging
tool usage
reflection loops
execution feedback

This is because code execution dramatically improves reasoning performance.

The model no longer has to compute everything mentally.

Instead:

Reasoning + Computation

becomes

Reasoning \rightarrow Code \rightarrow Execution

which is often significantly more reliable.

Final Thoughts

Code execution is one of the most important capabilities in Agentic AI.

Instead of building hundreds of specialized tools, we allow the model to create its own solution dynamically.

The result is a system capable of:

Advanced mathematics
Financial modeling
Data processing
Algorithmic reasoning
Workflow automation

However, with that power comes responsibility.

The best production systems combine:

Code\ Execution+ Reflection+ Sandboxing+ Evaluation

to create agents that are both powerful and safe.

As agentic systems continue to evolve, code execution will likely remain one of the foundational capabilities separating simple chatbots from truly autonomous AI systems.

Written by Hitesh Sahu, a passionate developer and blogger.

Sun May 31 2026

Share This on

← Previous

Tool Use in Agentic AI

Understanding the Model Context Protocol (MCP)

AI-AgenticAI/4-1-Agent-Code-Execution