Arun Pandian M

Arun Pandian M

Android Dev | Full-Stack & AI Learner

Written by: Arun Pandian MPublished on: Jun 8, 2026

Basic Interaction with LLMs — The Concepts Every AI Engineer Must Learn First

When people start learning AI Engineering, they often jump directly into topics like RAG, Agents, Vector Databases, and MCP. However, before building advanced AI systems, it is important to understand how Large Language Models (LLMs) work at the most basic level.

In this article, we’ll explore the foundational concepts every AI Engineer should understand before moving to more advanced topics.

The Big Picture

At its core, every AI application follows a simple flow:

text User Input      ↓ Prompt      ↓ LLM      ↓ Generated Response 

Whether you're building ChatGPT, a coding assistant, a document search tool, or an AI tutor, everything starts with this basic interaction.

https://storage.googleapis.com/lambdabricks-cd393.firebasestorage.app/img_llm_basic_concepts.svg?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=firebase-adminsdk-fbsvc%40lambdabricks-cd393.iam.gserviceaccount.com%2F20260611%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20260611T061208Z&X-Goog-Expires=3600&X-Goog-SignedHeaders=host&X-Goog-Signature=582f05939f3ed0ec0d523b09600696a4d534881042ac7929672be57887e8cf3d66dee3fe5f20b468285825f7c9b7c734ce770dba79dcc4bdea6285608c90a5040fb0c97423cbbe955d7c7f6723cda9f0521676f8d3af7bfc4066bd48697cc698dee2262a8847431df4b98ec6abb8f117491945a2e30489d09fb30e218b3e903f76537375896d1fe60762f7bc83dc95f033e6c066dc213619be180607ee0185f1ed0662f58693a6cc9820f3cc6e832388c8436ab566a36e542d4cb2823a17125be5eb48b74f27a3b83651aa11d41fa47b9a083115ad9055458e19a7dda1b260bdcadc411fadbf3838a54fedbce5b4f8fb491bfb4d327154af3cfb1cc7ed3991cd

1. What is an LLM?

An LLM (Large Language Model) is a machine learning model trained on enormous amounts of text data.

Instead of following hardcoded rules like traditional software, an LLM learns patterns from data and uses those patterns to predict the next token in a sequence.

Traditional Programming:

text Input   ↓ Rules   ↓ Output 

LLM:

text Input   ↓ Learned Patterns   ↓ Output 

Examples of LLMs include:

- Qwen - Llama - Gemma - DeepSeek - GPT

2. Training vs Inference

A common beginner misconception is that the model learns every time we ask a question. This is not true.

Training:

text Learn patterns from data

Inference:

text Use learned patterns to generate responses

When you run:

python response = ollama.chat(...) 

you are performing inference. The model has already been trained. It is simply generating an answer based on what it learned during training.

3. What is a Prompt?

A prompt is the input provided to the model.

Examples:

 What is Kotlin? 

 Explain functional programming. 

 Write a Python script to sort a list. 

The quality of the prompt directly affects the quality of the response.

4. Prompt vs Instruction

Many beginners use these terms interchangeably, but they are slightly different.

Prompt:

Tell me about Kotlin.

A prompt defines what information you want from the model.

Instruction:

 Explain Kotlin in one sentence. 

An instruction guides how the model should answer.

As AI Engineers, we often combine prompts and instructions to obtain better results.

5. Understanding Tokens

Large Language Models (LLMs) do not process text as complete words or sentences. Instead, they work with smaller units called tokens.

For example:

What is Kotlin? 

may be split internally into multiple tokens.

Similarly:

Functional Programming

may be broken into several tokens depending on the model’s tokenizer.

During generation, the model predicts one token at a time, continuously selecting the most likely next token until it completes the response.

Understanding tokens is important because:

  • Context windows are measured in tokens.
  • API costs are often calculated based on input and output tokens.
  • Model performance is commonly measured in tokens generated per second.
  • Conversation history, documents, and prompts all consume tokens from the available context window.
  • 6. Context Window

    The context window is the amount of information an LLM can “see” and use while generating a response. You can think of it as the model’s working memory during a conversation.

    For example:

    User: What is Kotlin?
    
    Assistant: Kotlin is a modern programming language...
    
    User: Explain Coroutines.

    When answering the second question, the model can see the earlier conversation and use that information as context.

    A simple analogy is reading a book. If you only remember the last paragraph, you may miss important details. If you remember several pages, you can better understand the story. The context window works in a similar way.

    Larger context windows allow models to handle:

  • Longer conversations
  • Large documents and PDFs
  • Multiple files at the same time
  • Detailed instructions and requirements
  • More relevant information without losing context
  • The context window is measured in tokens, and everything inside it—system prompts, user messages, conversation history, and documents—consumes part of the available space.

    A model can only reason about information that fits inside its current context window.

    7. Why Does the Model Remember Previous Messages?

    Many people assume that an LLM stores memory internally between messages. In reality, the model does not automatically remember past conversations. What actually happens is that the application sends the previous conversation along with each new request.

    For example:

    User: What is Kotlin?
    
    Assistant: Kotlin is a modern programming language...
    
    User: Explain Coroutines.
    
    Assistant: Coroutines are used for asynchronous programming...
    
    User: Next?

    When the model receives the message “Next?”, it also receives the earlier conversation as part of the context. This allows it to understand that “Next?” likely means continuing the discussion about Coroutines.

    A useful analogy is giving someone a printed transcript of a conversation before asking a new question. They appear to remember the discussion, but they are actually reading the previous messages again.

    This is why the model seems to remember earlier topics—it is not recalling them from memory, but using the conversation history that is included in the current context.

    In AI applications, this conversation history is often called context, and managing it effectively is a key part of building chat systems, assistants, and AI agents.

    8. What is Temperature?

    Temperature controls how predictable or creative an LLM’s responses are during text generation.

    You can think of temperature as a creativity knob:

  • Lower values make the model more focused and predictable.
  • Higher values make the model more creative and varied.
  • Low Temperature

    0.1

    At low temperatures, the model strongly prefers the most probable next token. This produces responses that are more consistent, reliable, and deterministic

    Example use cases:

  • Structured outputs (JSON)
  • Technical explanations
  • Code generation
  • Question answering
  • Data extraction
  • High Temperature

    1.2

    At high temperatures, the model is more willing to explore less probable token choices instead of always selecting the most likely one. This leads to more diverse, creative, and sometimes unexpected responses.

    Example use cases:

  • Story writing
  • Brainstorming ideas
  • Marketing content
  • Creative writing
  • Generating multiple alternatives
  • Example

    Prompt:

    Give me a name for a robot.

    Low Temperature (0.1):

    RoboBot

    The model chooses the most likely and safest option.

    High Temperature (1.2):

    NovaSpark
    QuantumWing
    EchoByte

    The model explores additional token possibilities, producing more varied and imaginative results.

    Why Does This Happen?

    When generating text, the model assigns probabilities to possible next tokens.

    Example:

    Next TokenProbability
    language70%
    tool15%
    framework10%

    With a low temperature, the model heavily favors the highest-probability token (language).

    With a high temperature, the probability distribution becomes flatter, giving lower-probability tokens (tool, framework) a better chance of being selected.

    Temperature does not change the model’s knowledge. Instead, it influences how the model chooses the next token during generation.

    In general:

  • Use low temperature when accuracy, consistency, and reliability are important.
  • Use high temperature when creativity, exploration, and variety are more important.
  • A simple mental model is:

    Low Temperature  → Most Probable Token
    High Temperature → Explore More Possible Tokens

    9. What is Streaming?

    Streaming is the process of sending a model’s response to the user as it is being generated, instead of waiting for the entire response to be completed.

    Without streaming:

    Request
       ↓
    Model Generates Entire Response
       ↓
    Full Response Returned

    The user sees nothing until the model finishes generating all tokens.

    With streaming:

    Request
       ↓
    H
    He
    Hel
    Hell
    Hello
       ↓
    Response Continues...

    The user starts receiving tokens immediately as they are generated.

    Why is Streaming Important?

    Streaming improves the user experience because the application feels much faster and more responsive. Even if the model takes several seconds to generate a complete response, users can begin reading the answer almost instantly.

    Benefits of streaming:

  • Faster perceived response time
  • Better user experience
  • Real-time feedback during generation
  • More natural conversational interactions
  • Example

    Suppose the model needs 10 seconds to generate a response.

    Without Streaming:

    Wait 10 seconds
    ↓
    Entire response appears

    With Streaming:

    Wait 1 second
    ↓
    First tokens appear
    ↓
    Response continues token by token

    Although the total generation time may be similar, streaming makes the application feel significantly faster.

    Streaming does not make the model generate tokens faster. Instead, it allows applications to display tokens as soon as they are produced, reducing the time users spend waiting for visible output.
    This is how ChatGPT, coding assistants, and most modern AI applications deliver responses in real time.

    10. Understanding Latency

    Latency is the time it takes for a model to start and complete a response after receiving a request.

    In AI applications, lower latency generally leads to a faster and more responsive user experience.

    Time To First Token (TTFT)

    Time To First Token (TTFT) measures the time between sending a request and receiving the first generated token.

    Request
       ↓
    Waiting...
       ↓
    First Token Appears

    Tokens Per Second (TPS)

    Tokens Per Second (TPS) measures how quickly the model generates tokens after generation begins.

    Token 1
    Token 2
    Token 3
    Token 4
    ...

    Higher TPS means the response is generated and displayed faster.

    Example

    Suppose a model takes:

    TTFT = 1.2 seconds
    TPS  = 50 tokens/second

    This means:

  • The first token appears after 1.2 seconds.
  • After generation starts, the model produces approximately 50 tokens every second.
  • Why Latency Matters

    Latency directly affects the user experience.

    Lower latency helps with:

    Interactive chat applications

  • AI assistants
  • Code generation tools
  • Real-time AI systems
  • Higher latency can make applications feel slow, even if the final answer is accurate.

    When evaluating model performance, it is important to consider both: - Time To First Token (TTFT) — How quickly the response begins. - Tokens Per Second (TPS) — How quickly the response continues.

    11. Structured Output

    AI responses do not always have to be plain text. Instead, we can ask the model to return data in a structured format such as JSON.

    For example:

    {
      "language": "Kotlin",
      "type": "Programming Language",
      "platform": "JVM"
    }

    Unlike natural language responses, structured outputs are designed to be easily processed by software applications.

    Why Use Structured Output?

    Structured data is predictable and machine-readable, making it easier for applications to parse, validate, store, and display information.

    User Request
          ↓
    LLM
          ↓
    JSON Response
          ↓
    Python Application
          ↓
    UI / Database / API

    Common Use Cases

    Structured outputs are commonly used for:

  • Data extraction
  • Ticket breakdown generation
  • Quiz and assessment generation
  • API integrations
  • Workflow automation
  • AI-powered applications

    Prompt:

    Explain Kotlin.
    
    Return the result as JSON.

    Response:

    {
      "name": "Kotlin",
      "type": "Programming Language",
      "primaryPlatform": "JVM",
      "createdBy": "JetBrains"
    }

    Plain Text = For Humans Structured Output (JSON) = For Applications

    Structured output allows AI systems to return data in a format that applications can directly consume and process. Instead of reading paragraphs of text, software can work with clearly defined fields and values. This is one of the most important techniques used in production AI systems because it enables reliable communication between AI models and applications.

    Many modern AI systems operate using the following pattern:

    User
     ↓
    LLM
     ↓
    Structured JSON
     ↓
    Validation
     ↓
    Application Logic
     ↓
    UI / Database

    This is why structured output is a foundational concept in AI Engineering.

    12. Choosing the Right Model

    Different AI models are optimized for different tasks, so there is no single “best” model for every use case.

    ModelTypical Strengths
    QwenGeneral knowledge, coding, and multilingual tasks
    DeepSeekCoding, reasoning, and technical problem solving
    GemmaLightweight assistance and efficient local deployment
    LlamaGeneral-purpose conversational AI
    Embedding ModelsSemantic search and information retrieval

    When selecting a model, consider factors such as:

  • Task requirements
  • Response quality
  • Latency and speed
  • Hardware limitations
  • Context window size
  • Cost and resource usage
  • For example, a smaller model may be ideal for a local assistant, while a larger model may provide better performance for complex reasoning tasks.

    Model selection is about finding the best fit for the problem you are trying to solve. In many cases, choosing a model that is well-suited to the task will produce better results than simply choosing the largest available model.

    A useful mental model is:

    Right Model + Right Prompt
    >
    Biggest Model + Poor Prompt

    Successful AI systems are built by matching the model’s strengths to the application’s requirements rather than always maximizing model size.

    Conclusion

    The concepts covered in this post form the foundation of working with Large Language Models. By understanding prompts, tokens, context windows, temperature, streaming, latency, structured outputs, and model selection, we can move beyond simply using AI and start building reliable AI-powered applications.

    #MachineLearning#Kotlin#SoftwareEngineering#LocalLLM#LearnAI#ContextWindow#Tokens#StructuredOutput#ArtificialIntelligence#AIEngineering#LLM#GenerativeAI#Ollama#OpenSourceAI#Python#PromptEngineering
    LAMBDA BRICKS