Using LLMs in Development
Practical examples of how large language models are integrated into real production systems, from support automation and knowledge retrieval to developer tooling, code generation, and intelligent assistants.
Using LLMs in Software Applications
Large Language Models are not just research tools anymore.
They are increasingly becoming components inside real software systems.
Instead of building complex ML pipelines from scratch, engineers can now integrate an LLM using a prompt and an API call.
In this article we explore how LLMs are used in production software, based on ideas from Andrew Ng's Generative AI for Everyone – Week 2.
Classic ML vs Generative AI Workflow
The workflow difference is significant:
-
Supervised learning
- Get labeled data
- Train AI model
- Deploy model
- Can take months
-
Prompt-based AI
- Specify prompt
- Deploy model
- Can take minutes, hours, or days
flowchart TD
A[Supervised Learning] --> A1[Get labeled data]
A1 --> A2[Train AI model on data]
A2 --> A3[Deploy run model]
B[Prompt-Based AI] --> B1[Specify prompt]
B1 --> B2[Deploy run model]
This is one of the biggest reasons LLMs are attractive in product development: they dramatically reduce time to first prototype.
Before LLMs, a common way to build a text application was supervised learning. For example, if a restaurant wanted to monitor online reviews, the team would:
Example:
Input: Restaurant reviews Output: Sentiment (Positive / Negative)
This process could take months.
Generative AI changes this dramatically.
- Collect labeled examples
- Train an AI model
- Deploy the model
The system learns a mapping from input text to output label . For sentiment classification:
where:
- is the review text
For example:
This approach works, but it is often slow because it depends on dataset creation and model training.
Instead of training a model, we can use prompting.
prompt = """
Classify the following review
as having either a positive or
negative sentiment:
The banana pudding was really
tasty!
"""
response = llm_response(prompt)
print(response)
Development time often drops from months to hours or days.
Prompt-Based Development
Instead of training a classifier, we can simply write a prompt.
Example:
prompt = """
Classify the following review
as either positive or negative:
The banana pudding was really tasty!
"""
response = llm_response(prompt)
print(response)
Expected output: Positive
This works because large language models already have general knowledge learned during pretraining.
Real Software Applications of LLMs
LLMs can power many types of applications.
Writing Applications
Examples:
- drafting emails
- generating reports
- marketing copy
- summarizing documents
Architecture:
User → Prompt → LLM → Generated Text
Reading Applications
LLMs can understand and extract information from text.
Example tasks:
- summarization
- information extraction
- sentiment analysis
- document classification
Example prompt:
Classify the sentiment of the following review:
Output: "The mochi is excellent!"
Chat Applications
LLMs also power conversational systems.
Example interaction:
User: I'd like a cheeseburger for delivery
Bot: Sure. Anything else?
User: That's all
Bot: It will arrive in 20 minutes
These systems combine:
- prompts
- conversation memory
- business logic
Lifecycle of a Generative AI Project
Building an AI system is an iterative engineering process.
Typical lifecycle:
- Scope project
- Build or improve system
- Internal evaluation
- Deploy and monitor
Lifecycle Diagram
flowchart LR
S[Scope project] --> B[Build or improve system]
B --> E[Internal evaluation]
E --> D[Deploy and monitor]
D --> B
A prototype may look good on a simple example, but fail on a slightly different one. For instance, a sentiment model may correctly label:
“The custard tart was amazing!” Positive
but incorrectly label:
“My pasta was cold” Positive
This shows why evaluation is essential. A working demo is not the same thing as a reliable product.
This loop is central to real LLM engineering. You ship a prototype, observe failure cases, improve prompts or architecture, and repeat.
This loop repeats continuously.
Example failure:
Prompt:
Classify sentiment
Input:
"My pasta was cold"
LLM Output:
Positive
Engineers must analyze failures and improve the system.
Improving LLM Performance
Building AI systems is highly empirical.
We improve performance through experimentation.
Common techniques include:
1. Prompting
Prompting is usually the first and cheapest lever.
You change the instructions, add examples, clarify format, or provide constraints.
2. Retrieval Augmented Generation (RAG)
RAG gives the LLM access to external data sources so it can answer questions using organization-specific information rather than relying only on its built-in knowledge.
3. Fine-tuning
Fine-tuning adapts a model to your task, style, or domain.
4. Pretraining
Pretraining means training an LLM from scratch.
This is the most expensive and hardest option, and usually the last resort.
Improvement Loop Diagram
flowchart LR
I[Idea] --> P[Prompt]
P --> R[LLM response]
R --> I
Cost Intuition
Estimate LLM cost using tokens. Roughly:
If a person reads about words per minute, then in one hour they consume about:
If the system also processes a similar amount of prompt text, total words might be around:
Converting words to tokens:
If cost is about:
then the total estimated cost is:
Retrieval Augmented Generation (RAG)
How RAG Works
The slides break RAG into three steps:
- Search relevant documents for an answer
- Insert retrieved text into the prompt
- Generate the answer from the updated prompt :contentReference[oaicite:18]{index=18}
Mermaid RAG Flow
flowchart TD
Q[User question] --> R1[Retrieve relevant documents]
R1 --> R2[Insert retrieved context into prompt]
R2 --> LLM[LLM generates answer]
LLM --> A[Grounded response]
Give the model access to external knowledge.
This allows models to answer questions about private or up-to-date data.
Conceptually, the prompt becomes:
For example:
This is powerful because the LLM is being used more as a reasoning engine than as a pure source of facts. It reads relevant text and uses that text to formulate an answer. :contentReference[oaicite:19]{index=19}
Fine-Tuning
Fine-tuning adapts a model to a specific task.
Pretraining: Train on massive internet text
Fine-tuning: Adapt model using smaller domain dataset
Typical scale:
| Stage | Data Size |
|---|---|
| Pretraining | billions of tokens |
| Fine-tuning | thousands of examples |
Use cases:
- domain-specific language
- structured outputs
- company-specific style
When Should You Pretrain a Model?
Pretraining an LLM is extremely expensive.
Typical requirements:
- hundreds of billions of tokens
- months of training
- tens of millions of dollars
For most application teams, pretraining should be an option of last resort. It only makes sense when the domain is highly specialized and existing models cannot be adapted effectively. :contentReference[oaicite:28]{index=28}
Decision Ladder
flowchart TD
A[Start with prompting] --> B{Good enough?}
B -- Yes --> Z[Deploy]
B -- No --> C[Try RAG]
C --> D{Good enough?}
D -- Yes --> Z
D -- No --> E[Try fine-tuning]
E --> F{Good enough?}
F -- Yes --> Z
F -- No --> G[Consider pretraining as last resort]
Therefore it should be considered a last resort.
Most applications use:
- prompting
- RAG
- fine-tuning
Choosing the Right Model Size
Different tasks require different model sizes.
| Model Size | Capabilities |
|---|---|
| 1B parameters | basic tasks |
| 10B parameters | moderate reasoning |
| 100B+ parameters | complex reasoning |
Example mapping:
| Task | Model Size |
|---|---|
| Sentiment classification | small |
| Chatbot | medium |
| Brainstorming assistant | large |
Closed vs Open Source Models
There are two major deployment strategies.
Closed Models
Examples:
- OpenAI
- Anthropic
Advantages:
- strong performance
- easy API integration
Disadvantages:
- vendor lock-in
- data privacy concerns
Open Source Models
Examples:
- LLaMA
- Mistral
- Falcon
Advantages:
- full control
- on-prem deployment
- better privacy
Disadvantages:
- infrastructure complexity
- weaker models (sometimes)
Tool Use with LLMs
LLMs can also call external tools.
Example:
User question: How much money will I have after 8 years if I deposit $100 at 5% interest?
Result: =147.74
Tool usage makes systems more reliable.
RLHF
RLHF trains a reward model that scores answers. Higher scores go to responses that are more helpful, honest, and harmless.
We can describe the reward idea as:
Then the model is optimized to produce responses with higher expected reward:
where is the model’s response policy.
RLHF Flow Diagram
flowchart TD
P[Prompt] --> G[Model generates candidate responses]
G --> H[Humans score responses]
H --> RM[Train reward model]
RM --> FT[Further train model to prefer high-reward responses]
This is one reason chat systems feel more aligned, polite, and useful than raw base models. :contentReference[oaicite:34]{index=34}
Tool Use
LLMs are powerful, but they are not reliable at everything. In particular, they often struggle with precise arithmetic or actions that require external systems.
The course shows a food-ordering example. A user says:
Send me a burger!
A naive chatbot may simply reply:
Ok, it’s on the way!
But that is not enough. A real system must gather order details, confirm address, and call the ordering backend. :contentReference[oaicite:35]{index=35}
Tool-Based Ordering Flow
flowchart TD
U[User message] --> L[LLM interprets request]
L --> T[Call ordering tool]
T --> C[Show confirmation to user]
C --> Y{User confirms?}
Y -- Yes --> O[Place order]
Y -- No --> X[Cancel or revise]
The tool call might conceptually look like:
This makes the LLM part of a larger application architecture rather than the whole system. :contentReference[oaicite:36]{index=36}
Tools for Reasoning
The slides also show that LLMs are not always good at exact math.
Question:
How much would I have after 8 years if I deposit $100 at 5% interest?
A model may produce the wrong number if it tries to reason directly in text. The more reliable method is tool use:
So the LLM should call an external calculator:
Math Tool Flow
flowchart TD
Q[User asks math question] --> LLM[LLM recognizes need for precise calculation]
LLM --> Calc[External calculator]
Calc --> Result[147.74]
Result --> Answer[LLM returns grounded answer]
This is an important engineering lesson: do not force the LLM to do tasks that a specialized tool can do more reliably. :contentReference[oaicite:37]{index=37}
Agents
Agents use LLMs to perform multi-step reasoning and actions.
Example task:
Research BetterBurgers competitors
Agent plan:
- Search competitors
- Visit websites
- Summarize each company
Agent Workflow Diagram
flowchart TD
U[User goal] --> P[LLM plans steps]
P --> S[Search]
S --> V[Visit websites]
V --> R[Read content]
R --> M[Summarize findings]
M --> O[Return final answer]
Agents are still an active research area, but the core idea is already useful: combine reasoning, planning, and tools to solve multi-step tasks.
The LLM acts as a controller that decides which tools to use.
Key Insight
The most important shift is this:
- LLMs are not just knowledge sources.
- They are reasoning engines that process information.
Instead of asking: What does the model know? We ask: What information can we give the model so it can reason about it?
