Arun Pandian M

Android Dev | Full-Stack & AI Learner

Jun 8, 2026

Written by: Arun Pandian M•Published on: Jun 8, 2026

Basic Interaction with LLMs — The Concepts Every AI Engineer Must Learn First

When people start learning AI Engineering, they often jump directly into topics like RAG, Agents, Vector Databases, and MCP. However, before building advanced AI systems, it is important to understand how Large Language Models (LLMs) work at the most basic level.

In this article, we’ll explore the foundational concepts every AI Engineer should understand before moving to more advanced topics.

The Big Picture

At its core, every AI application follows a simple flow:

text User Input      ↓ Prompt      ↓ LLM      ↓ Generated Response

Whether you're building ChatGPT, a coding assistant, a document search tool, or an AI tutor, everything starts with this basic interaction.

https://storage.googleapis.com/lambdabricks-cd393.firebasestorage.app/img_llm_basic_concepts.svg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=firebase-adminsdk-fbsvc%40lambdabricks-cd393.iam.gserviceaccount.com%2F20260726%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20260726T074736Z&X-Goog-Expires=3600&X-Goog-SignedHeaders=host&X-Goog-Signature=086390a675918c46b3d3bd0d6f3129693b7c8289610c642518a1b7be99dae2f7ad62503f0da6ca928623d00920b0caadfb52313f2d7effd3c53a62d4b42b447dec78aa6ac0f03dfda19f0ae58d92e6d97002dc1a89a22b5035aa333aa18c46ddb2623135e978ab10e73b124b9d5db6d1e2beb4695d26be87e333726d55d6158004c4f52af0c15aff3d9f4e2ad74d21c4b78682be241db130126e5a1ace1b20b8247bcf719d0aad1534cce4df11638b1516edf9319502be51dfca9c7721f87b7150f52815f3989ed9d73568a585661c0a66a7092abd6b0dad276f071df830be36c9147ab3c03df13e3c0db0298db7fde9cf91db9ac012b4f0fbc4bea3ee211e5a

1. What is an LLM?

An LLM (Large Language Model) is a machine learning model trained on enormous amounts of text data.

Instead of following hardcoded rules like traditional software, an LLM learns patterns from data and uses those patterns to predict the next token in a sequence.

Traditional Programming:

text Input   ↓ Rules   ↓ Output

LLM:

text Input   ↓ Learned Patterns   ↓ Output

Examples of LLMs include:

- Qwen - Llama - Gemma - DeepSeek - GPT

2. Training vs Inference

A common beginner misconception is that the model learns every time we ask a question. This is not true.

Training:

text Learn patterns from data

Inference:

text Use learned patterns to generate responses

When you run:

python response = ollama.chat(...)

you are performing inference. The model has already been trained. It is simply generating an answer based on what it learned during training.

3. What is a Prompt?

A prompt is the input provided to the model.

Examples:

 What is Kotlin? 

 Explain functional programming. 

 Write a Python script to sort a list.

The quality of the prompt directly affects the quality of the response.

4. Prompt vs Instruction

Many beginners use these terms interchangeably, but they are slightly different.

Prompt:

Tell me about Kotlin.

A prompt defines what information you want from the model.

Instruction:

 Explain Kotlin in one sentence.

An instruction guides how the model should answer.

As AI Engineers, we often combine prompts and instructions to obtain better results.

5. Understanding Tokens

Large Language Models (LLMs) do not process text as complete words or sentences. Instead, they work with smaller units called tokens.

For example:

What is Kotlin?

may be split internally into multiple tokens.

Similarly:

Functional Programming

may be broken into several tokens depending on the model’s tokenizer.

During generation, the model predicts one token at a time, continuously selecting the most likely next token until it completes the response.

Understanding tokens is important because:

Context windows are measured in tokens.

API costs are often calculated based on input and output tokens.

Model performance is commonly measured in tokens generated per second.

Conversation history, documents, and prompts all consume tokens from the available context window.

6. Context Window

The context window is the amount of information an LLM can “see” and use while generating a response. You can think of it as the model’s working memory during a conversation.

For example:

User: What is Kotlin?

Assistant: Kotlin is a modern programming language...

User: Explain Coroutines.

When answering the second question, the model can see the earlier conversation and use that information as context.

A simple analogy is reading a book. If you only remember the last paragraph, you may miss important details. If you remember several pages, you can better understand the story. The context window works in a similar way.

Larger context windows allow models to handle:

Longer conversations

Large documents and PDFs

Multiple files at the same time

Detailed instructions and requirements

More relevant information without losing context

The context window is measured in tokens, and everything inside it—system prompts, user messages, conversation history, and documents—consumes part of the available space.

A model can only reason about information that fits inside its current context window.

7. Why Does the Model Remember Previous Messages?

Many people assume that an LLM stores memory internally between messages. In reality, the model does not automatically remember past conversations. What actually happens is that the application sends the previous conversation along with each new request.

For example:

User: What is Kotlin?

Assistant: Kotlin is a modern programming language...

User: Explain Coroutines.

Assistant: Coroutines are used for asynchronous programming...

User: Next?

When the model receives the message “Next?”, it also receives the earlier conversation as part of the context. This allows it to understand that “Next?” likely means continuing the discussion about Coroutines.

A useful analogy is giving someone a printed transcript of a conversation before asking a new question. They appear to remember the discussion, but they are actually reading the previous messages again.

This is why the model seems to remember earlier topics—it is not recalling them from memory, but using the conversation history that is included in the current context.

In AI applications, this conversation history is often called context, and managing it effectively is a key part of building chat systems, assistants, and AI agents.

8. What is Temperature?

Temperature controls how predictable or creative an LLM’s responses are during text generation.

You can think of temperature as a creativity knob:

Lower values make the model more focused and predictable.

Higher values make the model more creative and varied.

Low Temperature

0.1

At low temperatures, the model strongly prefers the most probable next token. This produces responses that are more consistent, reliable, and deterministic

Example use cases:

Structured outputs (JSON)

Technical explanations

Code generation

Question answering

Data extraction

High Temperature

1.2

At high temperatures, the model is more willing to explore less probable token choices instead of always selecting the most likely one. This leads to more diverse, creative, and sometimes unexpected responses.

Example use cases:

Story writing

Brainstorming ideas

Marketing content

Creative writing

Generating multiple alternatives

Example

Prompt:

Give me a name for a robot.

Low Temperature (0.1):

RoboBot

The model chooses the most likely and safest option.

High Temperature (1.2):

NovaSpark
QuantumWing
EchoByte

The model explores additional token possibilities, producing more varied and imaginative results.

Why Does This Happen?

When generating text, the model assigns probabilities to possible next tokens.

Example:

Next Token	Probability
language	70%
tool	15%
framework	10%

With a low temperature, the model heavily favors the highest-probability token (language).

With a high temperature, the probability distribution becomes flatter, giving lower-probability tokens (tool, framework) a better chance of being selected.

Temperature does not change the model’s knowledge. Instead, it influences how the model chooses the next token during generation.

In general:

Use low temperature when accuracy, consistency, and reliability are important.

Use high temperature when creativity, exploration, and variety are more important.

A simple mental model is:

Low Temperature  → Most Probable Token
High Temperature → Explore More Possible Tokens

9. What is Streaming?

Streaming is the process of sending a model’s response to the user as it is being generated, instead of waiting for the entire response to be completed.

Without streaming:

Request
   ↓
Model Generates Entire Response
   ↓
Full Response Returned

The user sees nothing until the model finishes generating all tokens.

With streaming:

Request
   ↓
H
He
Hel
Hell
Hello
   ↓
Response Continues...

The user starts receiving tokens immediately as they are generated.

Why is Streaming Important?

Streaming improves the user experience because the application feels much faster and more responsive. Even if the model takes several seconds to generate a complete response, users can begin reading the answer almost instantly.

Benefits of streaming:

Faster perceived response time

Better user experience

Real-time feedback during generation

More natural conversational interactions

Example

Suppose the model needs 10 seconds to generate a response.

Without Streaming:

Wait 10 seconds
↓
Entire response appears

With Streaming:

Wait 1 second
↓
First tokens appear
↓
Response continues token by token

Although the total generation time may be similar, streaming makes the application feel significantly faster.

Streaming does not make the model generate tokens faster. Instead, it allows applications to display tokens as soon as they are produced, reducing the time users spend waiting for visible output.

This is how ChatGPT, coding assistants, and most modern AI applications deliver responses in real time.

10. Understanding Latency

Latency is the time it takes for a model to start and complete a response after receiving a request.

In AI applications, lower latency generally leads to a faster and more responsive user experience.

Time To First Token (TTFT)

Time To First Token (TTFT) measures the time between sending a request and receiving the first generated token.

Request
   ↓
Waiting...
   ↓
First Token Appears

Tokens Per Second (TPS)

Tokens Per Second (TPS) measures how quickly the model generates tokens after generation begins.

Token 1
Token 2
Token 3
Token 4
...

Higher TPS means the response is generated and displayed faster.

Example

Suppose a model takes:

TTFT = 1.2 seconds
TPS  = 50 tokens/second

This means:

The first token appears after 1.2 seconds.

After generation starts, the model produces approximately 50 tokens every second.

Why Latency Matters

Latency directly affects the user experience.

Lower latency helps with:

Interactive chat applications

AI assistants

Code generation tools

Real-time AI systems

Higher latency can make applications feel slow, even if the final answer is accurate.

When evaluating model performance, it is important to consider both: - Time To First Token (TTFT) — How quickly the response begins. - Tokens Per Second (TPS) — How quickly the response continues.

11. Structured Output

AI responses do not always have to be plain text. Instead, we can ask the model to return data in a structured format such as JSON.

For example:

{
  "language": "Kotlin",
  "type": "Programming Language",
  "platform": "JVM"
}

Unlike natural language responses, structured outputs are designed to be easily processed by software applications.

Why Use Structured Output?

Structured data is predictable and machine-readable, making it easier for applications to parse, validate, store, and display information.

User Request
      ↓
LLM
      ↓
JSON Response
      ↓
Python Application
      ↓
UI / Database / API

Common Use Cases

Structured outputs are commonly used for:

Data extraction

Ticket breakdown generation

Quiz and assessment generation

API integrations

Workflow automation

AI-powered applications

Prompt:

Explain Kotlin.

Return the result as JSON.

Response:

{
  "name": "Kotlin",
  "type": "Programming Language",
  "primaryPlatform": "JVM",
  "createdBy": "JetBrains"
}

Plain Text = For Humans Structured Output (JSON) = For Applications

Structured output allows AI systems to return data in a format that applications can directly consume and process. Instead of reading paragraphs of text, software can work with clearly defined fields and values. This is one of the most important techniques used in production AI systems because it enables reliable communication between AI models and applications.

Many modern AI systems operate using the following pattern:

User
 ↓
LLM
 ↓
Structured JSON
 ↓
Validation
 ↓
Application Logic
 ↓
UI / Database

This is why structured output is a foundational concept in AI Engineering.

12. Choosing the Right Model

Different AI models are optimized for different tasks, so there is no single “best” model for every use case.

Model	Typical Strengths
Qwen	General knowledge, coding, and multilingual tasks
DeepSeek	Coding, reasoning, and technical problem solving
Gemma	Lightweight assistance and efficient local deployment
Llama	General-purpose conversational AI
Embedding Models	Semantic search and information retrieval

When selecting a model, consider factors such as:

Task requirements

Response quality

Latency and speed

Hardware limitations

Context window size

Cost and resource usage

For example, a smaller model may be ideal for a local assistant, while a larger model may provide better performance for complex reasoning tasks.

Model selection is about finding the best fit for the problem you are trying to solve. In many cases, choosing a model that is well-suited to the task will produce better results than simply choosing the largest available model.

A useful mental model is:

Right Model + Right Prompt
>
Biggest Model + Poor Prompt

Successful AI systems are built by matching the model’s strengths to the application’s requirements rather than always maximizing model size.

Conclusion

The concepts covered in this post form the foundation of working with Large Language Models. By understanding prompts, tokens, context windows, temperature, streaming, latency, structured outputs, and model selection, we can move beyond simply using AI and start building reliable AI-powered applications.

#MachineLearning#Kotlin#SoftwareEngineering#LocalLLM#LearnAI#ContextWindow#Tokens#StructuredOutput#ArtificialIntelligence#AIEngineering#LLM#GenerativeAI#Ollama#OpenSourceAI#Python#PromptEngineering

← PreviousUnderstanding Ollama: Installing, Managing, and Running Local AI Models

Recommended for you

Understanding Ollama: Installing, Managing, and Running Local AI Models

1 min read

Understanding LLMs, Ollama, and Inference

1 min read

LB LAMBDA BRICKS