Using LLMs in Development

Practical examples of how large language models are integrated into real production systems, from support automation and knowledge retrieval to developer tooling, code generation, and intelligent assistants.

Written by Hitesh Sahu, a passionate developer and blogger.

Sat Mar 07 2026

Share This on

← Previous

Using LLMs in Production

Deep Learning Path 🤖

Using LLMs in Software Applications

Large Language Models are not just research tools anymore.
They are increasingly becoming components inside real software systems.

Instead of building complex ML pipelines from scratch, engineers can now integrate an LLM using a prompt and an API call.

In this article we explore how LLMs are used in production software, based on ideas from Andrew Ng's Generative AI for Everyone – Week 2.

Classic ML vs Generative AI Workflow

The workflow difference is significant:

Supervised learning
- Get labeled data
- Train AI model
- Deploy model
- Can take months
Prompt-based AI
- Specify prompt
- Deploy model
- Can take minutes, hours, or days

flowchart TD
    A[Supervised Learning] --> A1[Get labeled data]
    A1 --> A2[Train AI model on data]
    A2 --> A3[Deploy run model]

    B[Prompt-Based AI] --> B1[Specify prompt]
    B1 --> B2[Deploy run model]

This is one of the biggest reasons LLMs are attractive in product development: they dramatically reduce time to first prototype.

Before LLMs, a common way to build a text application was supervised learning. For example, if a restaurant wanted to monitor online reviews, the team would:

Input \rightarrow Labeled\ Data \rightarrow Model\ Training \rightarrow Deployment

Example:

Input: Restaurant reviews Output: Sentiment (Positive / Negative)

This process could take months.

Generative AI changes this dramatically.

Collect labeled examples
Train an AI model
Deploy the model

The system learns a mapping from input text $A$ to output label $B$ . For sentiment classification:

f(A) = B

where:

$A$ is the review text
$B \in \{\text{Positive}, \text{Negative}\}$

For example:

A = \text{"Best soup dumplings I've ever eaten."} \Rightarrow B = \text{Positive}

A = \text{"Not worth the 3 month wait for a reservation."} \Rightarrow B = \text{Negative}

This approach works, but it is often slow because it depends on dataset creation and model training.

Instead of training a model, we can use prompting.

prompt = """
Classify the following review
as having either a positive or
negative sentiment:
The banana pudding was really
tasty!
"""
response = llm_response(prompt)
print(response)

Development time often drops from months to hours or days.

Prompt-Based Development

Instead of training a classifier, we can simply write a prompt.

Example:

prompt = """
Classify the following review
as either positive or negative:

The banana pudding was really tasty!
"""

response = llm_response(prompt)
print(response)

Expected output: Positive

This works because large language models already have general knowledge learned during pretraining.

Real Software Applications of LLMs

LLMs can power many types of applications.

Writing Applications

Examples:

drafting emails
generating reports
marketing copy
summarizing documents

Architecture:

User → Prompt → LLM → Generated Text

Reading Applications

LLMs can understand and extract information from text.

Example tasks:

summarization
information extraction
sentiment analysis
document classification

Example prompt:

Classify the sentiment of the following review:

Output: "The mochi is excellent!"

Chat Applications

LLMs also power conversational systems.

Example interaction:

User: I'd like a cheeseburger for delivery
Bot: Sure. Anything else?
User: That's all
Bot: It will arrive in 20 minutes

These systems combine:

prompts
conversation memory
business logic

Lifecycle of a Generative AI Project

Building an AI system is an iterative engineering process.

Typical lifecycle:

Scope project
Build or improve system
Internal evaluation
Deploy and monitor

Lifecycle Diagram

flowchart LR
    S[Scope project] --> B[Build or improve system]
    B --> E[Internal evaluation]
    E --> D[Deploy and monitor]
    D --> B

A prototype may look good on a simple example, but fail on a slightly different one. For instance, a sentiment model may correctly label:

“The custard tart was amazing!” $\rightarrow$ Positive

but incorrectly label:

“My pasta was cold” $\rightarrow$ Positive

This shows why evaluation is essential. A working demo is not the same thing as a reliable product.

This loop is central to real LLM engineering. You ship a prototype, observe failure cases, improve prompts or architecture, and repeat.

This loop repeats continuously.

Example failure:

Prompt:
Classify sentiment

Input:
"My pasta was cold"

LLM Output:
Positive

Engineers must analyze failures and improve the system.

Improving LLM Performance

Building AI systems is highly empirical.

We improve performance through experimentation.

Common techniques include:

1. Prompting

Prompting is usually the first and cheapest lever.

You change the instructions, add examples, clarify format, or provide constraints.

2. Retrieval Augmented Generation (RAG)

RAG gives the LLM access to external data sources so it can answer questions using organization-specific information rather than relying only on its built-in knowledge.

3. Fine-tuning

Fine-tuning adapts a model to your task, style, or domain.

4. Pretraining

Pretraining means training an LLM from scratch.

This is the most expensive and hardest option, and usually the last resort.

Improvement Loop Diagram

flowchart LR
    I[Idea] --> P[Prompt]
    P --> R[LLM response]
    R --> I

Cost Intuition

Estimate LLM cost using tokens. Roughly:

1 \text{ token} \approx \frac{3}{4} \text{ word}

If a person reads about $250$ words per minute, then in one hour they consume about:

60 \times 250 = 15000 \text{ words}

If the system also processes a similar amount of prompt text, total words might be around:

15000 + 15000 = 30000 \text{ words}

Converting words to tokens:

30000 \text{ words} \approx 40000 \text{ tokens}

If cost is about:

\$0.002 \text{ per 1K tokens}

then the total estimated cost is:

40 \times 0.002 = \$0.08

Retrieval Augmented Generation (RAG)

How RAG Works

The slides break RAG into three steps:

Search relevant documents for an answer
Insert retrieved text into the prompt
Generate the answer from the updated prompt :contentReference[oaicite:18]{index=18}

Mermaid RAG Flow

flowchart TD
    Q[User question] --> R1[Retrieve relevant documents]
    R1 --> R2[Insert retrieved context into prompt]
    R2 --> LLM[LLM generates answer]
    LLM --> A[Grounded response]

Give the model access to external knowledge.

This allows models to answer questions about private or up-to-date data.

Conceptually, the prompt becomes:

\text{Prompt} = \text{Instruction} + \text{Retrieved Context} + \text{Question}

For example:

\text{Answer} = \text{LLM}(\text{Instruction} + \text{Parking Policy} + \text{Question})

This is powerful because the LLM is being used more as a reasoning engine than as a pure source of facts. It reads relevant text and uses that text to formulate an answer. :contentReference[oaicite:19]{index=19}

Fine-Tuning

Fine-tuning adapts a model to a specific task.

Pretraining: Train on massive internet text

Fine-tuning: Adapt model using smaller domain dataset

Typical scale:

Stage	Data Size
Pretraining	billions of tokens
Fine-tuning	thousands of examples

Use cases:

domain-specific language
structured outputs
company-specific style

When Should You Pretrain a Model?

Pretraining an LLM is extremely expensive.

Typical requirements:

hundreds of billions of tokens
months of training
tens of millions of dollars

For most application teams, pretraining should be an option of last resort. It only makes sense when the domain is highly specialized and existing models cannot be adapted effectively. :contentReference[oaicite:28]{index=28}

Decision Ladder

flowchart TD
    A[Start with prompting] --> B{Good enough?}
    B -- Yes --> Z[Deploy]
    B -- No --> C[Try RAG]
    C --> D{Good enough?}
    D -- Yes --> Z
    D -- No --> E[Try fine-tuning]
    E --> F{Good enough?}
    F -- Yes --> Z
    F -- No --> G[Consider pretraining as last resort]

Therefore it should be considered a last resort.

Most applications use:

prompting
RAG
fine-tuning

Choosing the Right Model Size

Different tasks require different model sizes.

Model Size	Capabilities
1B parameters	basic tasks
10B parameters	moderate reasoning
100B+ parameters	complex reasoning

Example mapping:

Task	Model Size
Sentiment classification	small
Chatbot	medium
Brainstorming assistant	large

Closed vs Open Source Models

There are two major deployment strategies.

Closed Models

Examples:

OpenAI
Anthropic
Google

Advantages:

strong performance
easy API integration

Disadvantages:

vendor lock-in
data privacy concerns

Open Source Models

Examples:

LLaMA
Mistral
Falcon

Advantages:

full control
on-prem deployment
better privacy

Disadvantages:

infrastructure complexity
weaker models (sometimes)

Tool Use with LLMs

LLMs can also call external tools.

Example:

User question: How much money will I have after 8 years if I deposit $100 at 5% interest?

Result: =147.74

Tool usage makes systems more reliable.

RLHF

RLHF trains a reward model that scores answers. Higher scores go to responses that are more helpful, honest, and harmless.

We can describe the reward idea as:

r = \text{Reward}(\text{response} \mid \text{prompt})

Then the model is optimized to produce responses with higher expected reward:

\max_{\pi} \mathbb{E}[r]

where $\pi$ is the model’s response policy.

RLHF Flow Diagram

flowchart TD
    P[Prompt] --> G[Model generates candidate responses]
    G --> H[Humans score responses]
    H --> RM[Train reward model]
    RM --> FT[Further train model to prefer high-reward responses]

This is one reason chat systems feel more aligned, polite, and useful than raw base models. :contentReference[oaicite:34]{index=34}

Tool Use

LLMs are powerful, but they are not reliable at everything. In particular, they often struggle with precise arithmetic or actions that require external systems.

The course shows a food-ordering example. A user says:

Send me a burger!

A naive chatbot may simply reply:

Ok, it’s on the way!

But that is not enough. A real system must gather order details, confirm address, and call the ordering backend. :contentReference[oaicite:35]{index=35}

Tool-Based Ordering Flow

flowchart TD
    U[User message] --> L[LLM interprets request]
    L --> T[Call ordering tool]
    T --> C[Show confirmation to user]
    C --> Y{User confirms?}
    Y -- Yes --> O[Place order]
    Y -- No --> X[Cancel or revise]

The tool call might conceptually look like:

\text{ORDER}(\text{Burger}, 9876, \text{"1234 My Street"})

This makes the LLM part of a larger application architecture rather than the whole system. :contentReference[oaicite:36]{index=36}

Tools for Reasoning

The slides also show that LLMs are not always good at exact math.

Question:

How much would I have after 8 years if I deposit $100 at 5% interest?

A model may produce the wrong number if it tries to reason directly in text. The more reliable method is tool use:

100 \times 1.05^8 = 147.74

So the LLM should call an external calculator:

\text{CALCULATOR}(100 \times 1.05^8)

Math Tool Flow

flowchart TD
    Q[User asks math question] --> LLM[LLM recognizes need for precise calculation]
    LLM --> Calc[External calculator]
    Calc --> Result[147.74]
    Result --> Answer[LLM returns grounded answer]

This is an important engineering lesson: do not force the LLM to do tasks that a specialized tool can do more reliably. :contentReference[oaicite:37]{index=37}

Agents

Agents use LLMs to perform multi-step reasoning and actions.

Example task:

Research BetterBurgers competitors

Agent plan:

Search competitors
Visit websites
Summarize each company

Agent Workflow Diagram

flowchart TD
    U[User goal] --> P[LLM plans steps]
    P --> S[Search]
    S --> V[Visit websites]
    V --> R[Read content]
    R --> M[Summarize findings]
    M --> O[Return final answer]

Agents are still an active research area, but the core idea is already useful: combine reasoning, planning, and tools to solve multi-step tasks.

The LLM acts as a controller that decides which tools to use.

Key Insight

The most important shift is this:

LLMs are not just knowledge sources.
They are reasoning engines that process information.

Instead of asking: What does the model know? We ask: What information can we give the model so it can reason about it?

Using LLMs in Development

Practical examples of how large language models are integrated into real production systems, from support automation and knowledge retrieval to developer tooling, code generation, and intelligent assistants.

Written by Hitesh Sahu, a passionate developer and blogger.

Sat Mar 07 2026

Share This on

← Previous

Using LLMs in Production

Deep Learning Path 🤖

Using LLMs in Software Applications

Large Language Models are not just research tools anymore.
They are increasingly becoming components inside real software systems.

Instead of building complex ML pipelines from scratch, engineers can now integrate an LLM using a prompt and an API call.

In this article we explore how LLMs are used in production software, based on ideas from Andrew Ng's Generative AI for Everyone – Week 2.

Classic ML vs Generative AI Workflow

The workflow difference is significant:

Supervised learning
- Get labeled data
- Train AI model
- Deploy model
- Can take months
Prompt-based AI
- Specify prompt
- Deploy model
- Can take minutes, hours, or days

flowchart TD
    A[Supervised Learning] --> A1[Get labeled data]
    A1 --> A2[Train AI model on data]
    A2 --> A3[Deploy run model]

    B[Prompt-Based AI] --> B1[Specify prompt]
    B1 --> B2[Deploy run model]

This is one of the biggest reasons LLMs are attractive in product development: they dramatically reduce time to first prototype.

Before LLMs, a common way to build a text application was supervised learning. For example, if a restaurant wanted to monitor online reviews, the team would:

Input \rightarrow Labeled\ Data \rightarrow Model\ Training \rightarrow Deployment

Example:

Input: Restaurant reviews Output: Sentiment (Positive / Negative)

This process could take months.

Generative AI changes this dramatically.

Collect labeled examples
Train an AI model
Deploy the model

The system learns a mapping from input text $A$ to output label $B$ . For sentiment classification:

f(A) = B

where:

$A$ is the review text
$B \in \{\text{Positive}, \text{Negative}\}$

For example:

A = \text{"Best soup dumplings I've ever eaten."} \Rightarrow B = \text{Positive}

A = \text{"Not worth the 3 month wait for a reservation."} \Rightarrow B = \text{Negative}

This approach works, but it is often slow because it depends on dataset creation and model training.

Instead of training a model, we can use prompting.

prompt = """
Classify the following review
as having either a positive or
negative sentiment:
The banana pudding was really
tasty!
"""
response = llm_response(prompt)
print(response)

Development time often drops from months to hours or days.

Prompt-Based Development

Instead of training a classifier, we can simply write a prompt.

Example:

prompt = """
Classify the following review
as either positive or negative:

The banana pudding was really tasty!
"""

response = llm_response(prompt)
print(response)

Expected output: Positive

This works because large language models already have general knowledge learned during pretraining.

Real Software Applications of LLMs

LLMs can power many types of applications.

Writing Applications

Examples:

drafting emails
generating reports
marketing copy
summarizing documents

Architecture:

User → Prompt → LLM → Generated Text

Reading Applications

LLMs can understand and extract information from text.

Example tasks:

summarization
information extraction
sentiment analysis
document classification

Example prompt:

Classify the sentiment of the following review:

Output: "The mochi is excellent!"

Chat Applications

LLMs also power conversational systems.

Example interaction:

User: I'd like a cheeseburger for delivery
Bot: Sure. Anything else?
User: That's all
Bot: It will arrive in 20 minutes

These systems combine:

prompts
conversation memory
business logic

Lifecycle of a Generative AI Project

Building an AI system is an iterative engineering process.

Typical lifecycle:

Scope project
Build or improve system
Internal evaluation
Deploy and monitor

Lifecycle Diagram

flowchart LR
    S[Scope project] --> B[Build or improve system]
    B --> E[Internal evaluation]
    E --> D[Deploy and monitor]
    D --> B

A prototype may look good on a simple example, but fail on a slightly different one. For instance, a sentiment model may correctly label:

“The custard tart was amazing!” $\rightarrow$ Positive

but incorrectly label:

“My pasta was cold” $\rightarrow$ Positive

This shows why evaluation is essential. A working demo is not the same thing as a reliable product.

This loop is central to real LLM engineering. You ship a prototype, observe failure cases, improve prompts or architecture, and repeat.

This loop repeats continuously.

Example failure:

Prompt:
Classify sentiment

Input:
"My pasta was cold"

LLM Output:
Positive

Engineers must analyze failures and improve the system.

Improving LLM Performance

Building AI systems is highly empirical.

We improve performance through experimentation.

Common techniques include:

1. Prompting

Prompting is usually the first and cheapest lever.

You change the instructions, add examples, clarify format, or provide constraints.

2. Retrieval Augmented Generation (RAG)

RAG gives the LLM access to external data sources so it can answer questions using organization-specific information rather than relying only on its built-in knowledge.

3. Fine-tuning

Fine-tuning adapts a model to your task, style, or domain.

4. Pretraining

Pretraining means training an LLM from scratch.

This is the most expensive and hardest option, and usually the last resort.

Improvement Loop Diagram

flowchart LR
    I[Idea] --> P[Prompt]
    P --> R[LLM response]
    R --> I

Cost Intuition

Estimate LLM cost using tokens. Roughly:

1 \text{ token} \approx \frac{3}{4} \text{ word}

If a person reads about $250$ words per minute, then in one hour they consume about:

60 \times 250 = 15000 \text{ words}

If the system also processes a similar amount of prompt text, total words might be around:

15000 + 15000 = 30000 \text{ words}

Converting words to tokens:

30000 \text{ words} \approx 40000 \text{ tokens}

If cost is about:

\$0.002 \text{ per 1K tokens}

then the total estimated cost is:

40 \times 0.002 = \$0.08

Retrieval Augmented Generation (RAG)

How RAG Works

The slides break RAG into three steps:

Search relevant documents for an answer
Insert retrieved text into the prompt
Generate the answer from the updated prompt :contentReference[oaicite:18]{index=18}

Mermaid RAG Flow

flowchart TD
    Q[User question] --> R1[Retrieve relevant documents]
    R1 --> R2[Insert retrieved context into prompt]
    R2 --> LLM[LLM generates answer]
    LLM --> A[Grounded response]

Give the model access to external knowledge.

This allows models to answer questions about private or up-to-date data.

Conceptually, the prompt becomes:

\text{Prompt} = \text{Instruction} + \text{Retrieved Context} + \text{Question}

For example:

\text{Answer} = \text{LLM}(\text{Instruction} + \text{Parking Policy} + \text{Question})

Fine-Tuning

Fine-tuning adapts a model to a specific task.

Pretraining: Train on massive internet text

Fine-tuning: Adapt model using smaller domain dataset

Typical scale:

Stage	Data Size
Pretraining	billions of tokens
Fine-tuning	thousands of examples

Use cases:

domain-specific language
structured outputs
company-specific style

When Should You Pretrain a Model?

Pretraining an LLM is extremely expensive.

Typical requirements:

hundreds of billions of tokens
months of training
tens of millions of dollars

Decision Ladder

flowchart TD
    A[Start with prompting] --> B{Good enough?}
    B -- Yes --> Z[Deploy]
    B -- No --> C[Try RAG]
    C --> D{Good enough?}
    D -- Yes --> Z
    D -- No --> E[Try fine-tuning]
    E --> F{Good enough?}
    F -- Yes --> Z
    F -- No --> G[Consider pretraining as last resort]

Therefore it should be considered a last resort.

Most applications use:

prompting
RAG
fine-tuning

Choosing the Right Model Size

Different tasks require different model sizes.

Model Size	Capabilities
1B parameters	basic tasks
10B parameters	moderate reasoning
100B+ parameters	complex reasoning

Example mapping:

Task	Model Size
Sentiment classification	small
Chatbot	medium
Brainstorming assistant	large

Closed vs Open Source Models

There are two major deployment strategies.

Closed Models

Examples:

OpenAI
Anthropic
Google

Advantages:

strong performance
easy API integration

Disadvantages:

vendor lock-in
data privacy concerns

Open Source Models

Examples:

LLaMA
Mistral
Falcon

Advantages:

full control
on-prem deployment
better privacy

Disadvantages:

infrastructure complexity
weaker models (sometimes)

Tool Use with LLMs

LLMs can also call external tools.

Example:

User question: How much money will I have after 8 years if I deposit $100 at 5% interest?

Result: =147.74

Tool usage makes systems more reliable.

RLHF

RLHF trains a reward model that scores answers. Higher scores go to responses that are more helpful, honest, and harmless.

We can describe the reward idea as:

r = \text{Reward}(\text{response} \mid \text{prompt})

Then the model is optimized to produce responses with higher expected reward:

\max_{\pi} \mathbb{E}[r]

where $\pi$ is the model’s response policy.

RLHF Flow Diagram

flowchart TD
    P[Prompt] --> G[Model generates candidate responses]
    G --> H[Humans score responses]
    H --> RM[Train reward model]
    RM --> FT[Further train model to prefer high-reward responses]

This is one reason chat systems feel more aligned, polite, and useful than raw base models. :contentReference[oaicite:34]{index=34}

Tool Use

LLMs are powerful, but they are not reliable at everything. In particular, they often struggle with precise arithmetic or actions that require external systems.

The course shows a food-ordering example. A user says:

Send me a burger!

A naive chatbot may simply reply:

Ok, it’s on the way!

But that is not enough. A real system must gather order details, confirm address, and call the ordering backend. :contentReference[oaicite:35]{index=35}

Tool-Based Ordering Flow

flowchart TD
    U[User message] --> L[LLM interprets request]
    L --> T[Call ordering tool]
    T --> C[Show confirmation to user]
    C --> Y{User confirms?}
    Y -- Yes --> O[Place order]
    Y -- No --> X[Cancel or revise]

The tool call might conceptually look like:

\text{ORDER}(\text{Burger}, 9876, \text{"1234 My Street"})

This makes the LLM part of a larger application architecture rather than the whole system. :contentReference[oaicite:36]{index=36}

Tools for Reasoning

The slides also show that LLMs are not always good at exact math.

Question:

How much would I have after 8 years if I deposit $100 at 5% interest?

A model may produce the wrong number if it tries to reason directly in text. The more reliable method is tool use:

100 \times 1.05^8 = 147.74

So the LLM should call an external calculator:

\text{CALCULATOR}(100 \times 1.05^8)

Math Tool Flow

flowchart TD
    Q[User asks math question] --> LLM[LLM recognizes need for precise calculation]
    LLM --> Calc[External calculator]
    Calc --> Result[147.74]
    Result --> Answer[LLM returns grounded answer]

This is an important engineering lesson: do not force the LLM to do tasks that a specialized tool can do more reliably. :contentReference[oaicite:37]{index=37}

Agents

Agents use LLMs to perform multi-step reasoning and actions.

Example task:

Research BetterBurgers competitors

Agent plan:

Search competitors
Visit websites
Summarize each company

Agent Workflow Diagram

flowchart TD
    U[User goal] --> P[LLM plans steps]
    P --> S[Search]
    S --> V[Visit websites]
    V --> R[Read content]
    R --> M[Summarize findings]
    M --> O[Return final answer]

Agents are still an active research area, but the core idea is already useful: combine reasoning, planning, and tools to solve multi-step tasks.

The LLM acts as a controller that decides which tools to use.

Key Insight

The most important shift is this:

LLMs are not just knowledge sources.
They are reasoning engines that process information.

Instead of asking: What does the model know? We ask: What information can we give the model so it can reason about it?

Using LLMs in Development

Practical examples of how large language models are integrated into real production systems, from support automation and knowledge retrieval to developer tooling, code generation, and intelligent assistants.

Written by Hitesh Sahu, a passionate developer and blogger.

Using LLMs in Software Applications

Classic ML vs Generative AI Workflow

Prompt-Based Development

Real Software Applications of LLMs

Reading Applications

Chat Applications

Lifecycle of a Generative AI Project

Lifecycle Diagram

Improving LLM Performance

1. Prompting

2. Retrieval Augmented Generation (RAG)

3. Fine-tuning

4. Pretraining

Improvement Loop Diagram

Cost Intuition

Retrieval Augmented Generation (RAG)

How RAG Works

Mermaid RAG Flow

Fine-Tuning

Use cases:

When Should You Pretrain a Model?

Decision Ladder

Choosing the Right Model Size

Closed vs Open Source Models

Closed Models

Open Source Models

Tool Use with LLMs

RLHF

RLHF Flow Diagram

Tool Use

Tool-Based Ordering Flow

Tools for Reasoning

Math Tool Flow

Agents

Agent Workflow Diagram

Key Insight

Fetching content, this won’t take long…

🦥 Sloths can hold their breath longer than dolphins 🐬.

Using LLMs in Development

Practical examples of how large language models are integrated into real production systems, from support automation and knowledge retrieval to developer tooling, code generation, and intelligent assistants.

Written by Hitesh Sahu, a passionate developer and blogger.

Using LLMs in Software Applications

Classic ML vs Generative AI Workflow

Prompt-Based Development

Real Software Applications of LLMs

Reading Applications

Chat Applications

Lifecycle of a Generative AI Project

Lifecycle Diagram

Improving LLM Performance

1. Prompting

2. Retrieval Augmented Generation (RAG)

3. Fine-tuning

4. Pretraining

Improvement Loop Diagram

Cost Intuition

Retrieval Augmented Generation (RAG)

How RAG Works

Mermaid RAG Flow

Fine-Tuning

Use cases:

When Should You Pretrain a Model?

Decision Ladder

Choosing the Right Model Size

Closed vs Open Source Models

Closed Models

Open Source Models

Tool Use with LLMs

RLHF

RLHF Flow Diagram

Tool Use

Tool-Based Ordering Flow

Tools for Reasoning

Math Tool Flow

Agents

Agent Workflow Diagram

Key Insight