Basic Interaction with LLMs — The Concepts Every AI Engineer Must Learn First
When people start learning AI Engineering, they often jump directly into topics like RAG, Agents, Vector Databases, and MCP. However, before building advanced AI systems, it is important to understand how Large Language Models (LLMs) work at the most basic level.
In this article, we’ll explore the foundational concepts every AI Engineer should understand before moving to more advanced topics.
The Big Picture
At its core, every AI application follows a simple flow:
text User Input ↓ Prompt ↓ LLM ↓ Generated Response Whether you're building ChatGPT, a coding assistant, a document search tool, or an AI tutor, everything starts with this basic interaction.
1. What is an LLM?
An LLM (Large Language Model) is a machine learning model trained on enormous amounts of text data.
Instead of following hardcoded rules like traditional software, an LLM learns patterns from data and uses those patterns to predict the next token in a sequence.
Traditional Programming:
text Input ↓ Rules ↓ Output LLM:
text Input ↓ Learned Patterns ↓ Output Examples of LLMs include:
- Qwen - Llama - Gemma - DeepSeek - GPT
2. Training vs Inference
A common beginner misconception is that the model learns every time we ask a question. This is not true.
Training:
text Learn patterns from data
Inference:
text Use learned patterns to generate responses
When you run:
python response = ollama.chat(...) you are performing inference. The model has already been trained. It is simply generating an answer based on what it learned during training.
3. What is a Prompt?
A prompt is the input provided to the model.
Examples:
What is Kotlin?
Explain functional programming.
Write a Python script to sort a list. The quality of the prompt directly affects the quality of the response.
4. Prompt vs Instruction
Many beginners use these terms interchangeably, but they are slightly different.
Prompt:
Tell me about Kotlin.A prompt defines what information you want from the model.
Instruction:
Explain Kotlin in one sentence. An instruction guides how the model should answer.
As AI Engineers, we often combine prompts and instructions to obtain better results.
5. Understanding Tokens
Large Language Models (LLMs) do not process text as complete words or sentences. Instead, they work with smaller units called tokens.
For example:
What is Kotlin? may be split internally into multiple tokens.
Similarly:
Functional Programmingmay be broken into several tokens depending on the model’s tokenizer.
During generation, the model predicts one token at a time, continuously selecting the most likely next token until it completes the response.
Understanding tokens is important because:
6. Context Window
The context window is the amount of information an LLM can “see” and use while generating a response. You can think of it as the model’s working memory during a conversation.
For example:
User: What is Kotlin?
Assistant: Kotlin is a modern programming language...
User: Explain Coroutines.When answering the second question, the model can see the earlier conversation and use that information as context.
A simple analogy is reading a book. If you only remember the last paragraph, you may miss important details. If you remember several pages, you can better understand the story. The context window works in a similar way.
Larger context windows allow models to handle:
The context window is measured in tokens, and everything inside it—system prompts, user messages, conversation history, and documents—consumes part of the available space.
A model can only reason about information that fits inside its current context window.
7. Why Does the Model Remember Previous Messages?
Many people assume that an LLM stores memory internally between messages. In reality, the model does not automatically remember past conversations. What actually happens is that the application sends the previous conversation along with each new request.
For example:
User: What is Kotlin?
Assistant: Kotlin is a modern programming language...
User: Explain Coroutines.
Assistant: Coroutines are used for asynchronous programming...
User: Next?When the model receives the message “Next?”, it also receives the earlier conversation as part of the context. This allows it to understand that “Next?” likely means continuing the discussion about Coroutines.
A useful analogy is giving someone a printed transcript of a conversation before asking a new question. They appear to remember the discussion, but they are actually reading the previous messages again.
This is why the model seems to remember earlier topics—it is not recalling them from memory, but using the conversation history that is included in the current context.
In AI applications, this conversation history is often called context, and managing it effectively is a key part of building chat systems, assistants, and AI agents.
8. What is Temperature?
Temperature controls how predictable or creative an LLM’s responses are during text generation.
You can think of temperature as a creativity knob:
Low Temperature
0.1At low temperatures, the model strongly prefers the most probable next token. This produces responses that are more consistent, reliable, and deterministic
Example use cases:
High Temperature
1.2At high temperatures, the model is more willing to explore less probable token choices instead of always selecting the most likely one. This leads to more diverse, creative, and sometimes unexpected responses.
Example use cases:
Example
Prompt:
Give me a name for a robot.Low Temperature (0.1):
RoboBotThe model chooses the most likely and safest option.
High Temperature (1.2):
NovaSpark
QuantumWing
EchoByteThe model explores additional token possibilities, producing more varied and imaginative results.
Why Does This Happen?
When generating text, the model assigns probabilities to possible next tokens.
Example:
| Next Token | Probability |
|---|---|
| language | 70% |
| tool | 15% |
| framework | 10% |
With a low temperature, the model heavily favors the highest-probability token (language).
With a high temperature, the probability distribution becomes flatter, giving lower-probability tokens (tool, framework) a better chance of being selected.
In general:
A simple mental model is:
Low Temperature → Most Probable Token
High Temperature → Explore More Possible Tokens9. What is Streaming?
Streaming is the process of sending a model’s response to the user as it is being generated, instead of waiting for the entire response to be completed.
Without streaming:
Request
↓
Model Generates Entire Response
↓
Full Response ReturnedThe user sees nothing until the model finishes generating all tokens.
With streaming:
Request
↓
H
He
Hel
Hell
Hello
↓
Response Continues...The user starts receiving tokens immediately as they are generated.
Why is Streaming Important?
Streaming improves the user experience because the application feels much faster and more responsive. Even if the model takes several seconds to generate a complete response, users can begin reading the answer almost instantly.
Benefits of streaming:
Example
Suppose the model needs 10 seconds to generate a response.
Without Streaming:
Wait 10 seconds
↓
Entire response appearsWith Streaming:
Wait 1 second
↓
First tokens appear
↓
Response continues token by tokenAlthough the total generation time may be similar, streaming makes the application feel significantly faster.
This is how ChatGPT, coding assistants, and most modern AI applications deliver responses in real time.
10. Understanding Latency
Latency is the time it takes for a model to start and complete a response after receiving a request.
In AI applications, lower latency generally leads to a faster and more responsive user experience.
Time To First Token (TTFT)
Time To First Token (TTFT) measures the time between sending a request and receiving the first generated token.
Request
↓
Waiting...
↓
First Token AppearsTokens Per Second (TPS)
Tokens Per Second (TPS) measures how quickly the model generates tokens after generation begins.
Token 1
Token 2
Token 3
Token 4
...Higher TPS means the response is generated and displayed faster.
Example
Suppose a model takes:
TTFT = 1.2 seconds
TPS = 50 tokens/secondThis means:
Why Latency Matters
Latency directly affects the user experience.
Lower latency helps with:
Interactive chat applications
Higher latency can make applications feel slow, even if the final answer is accurate.
11. Structured Output
AI responses do not always have to be plain text. Instead, we can ask the model to return data in a structured format such as JSON.
For example:
{
"language": "Kotlin",
"type": "Programming Language",
"platform": "JVM"
}Unlike natural language responses, structured outputs are designed to be easily processed by software applications.
Why Use Structured Output?
Structured data is predictable and machine-readable, making it easier for applications to parse, validate, store, and display information.
User Request
↓
LLM
↓
JSON Response
↓
Python Application
↓
UI / Database / APICommon Use Cases
Structured outputs are commonly used for:
AI-powered applications
Prompt:
Explain Kotlin.
Return the result as JSON.Response:
{
"name": "Kotlin",
"type": "Programming Language",
"primaryPlatform": "JVM",
"createdBy": "JetBrains"
}Plain Text = For Humans Structured Output (JSON) = For Applications
Many modern AI systems operate using the following pattern:
User
↓
LLM
↓
Structured JSON
↓
Validation
↓
Application Logic
↓
UI / DatabaseThis is why structured output is a foundational concept in AI Engineering.
12. Choosing the Right Model
Different AI models are optimized for different tasks, so there is no single “best” model for every use case.
| Model | Typical Strengths |
|---|---|
| Qwen | General knowledge, coding, and multilingual tasks |
| DeepSeek | Coding, reasoning, and technical problem solving |
| Gemma | Lightweight assistance and efficient local deployment |
| Llama | General-purpose conversational AI |
| Embedding Models | Semantic search and information retrieval |
When selecting a model, consider factors such as:
For example, a smaller model may be ideal for a local assistant, while a larger model may provide better performance for complex reasoning tasks.
A useful mental model is:
Right Model + Right Prompt
>
Biggest Model + Poor PromptSuccessful AI systems are built by matching the model’s strengths to the application’s requirements rather than always maximizing model size.
Conclusion
The concepts covered in this post form the foundation of working with Large Language Models. By understanding prompts, tokens, context windows, temperature, streaming, latency, structured outputs, and model selection, we can move beyond simply using AI and start building reliable AI-powered applications.
