How LLMs Generate Text: Understanding the Inference Process

How LLMs Generate Text: Understanding the Inference Process

Ben Um • April 7, 2026

Large language models generate text through a mechanical process called inference. This process is not magical — it follows a clear, repeatable sequence of steps that turns your prompt into coherent output token by token.

Before we dive deeper, a few helpful definitions:

A token is the basic unit that large language models work with. It’s usually a word, part of a word, or even a single punctuation mark. For example, the sentence “Hello, how are you?” might be split into 5 or 6 tokens depending on the tokenizer.

Terminology Note:
In the broader literature and most inference systems, a collection of Key and Value vectors produced during inference is commonly called the “KV cache”. In this article, I refer to it as the “KV substrate”. I chose this term to emphasize its true importance: it is not just a temporary storage trick for speed. Instead, it is the foundational, passive medium that holds the model’s compiled understanding of the entire prompt and conversation so far — including instructions, context, facts, tone, and constraints.

Large language models don’t “think” in one magical step. They generate text through a very specific, repeatable mechanical process called inference. This process has two main phases: Prefill and Decode.

1. Prefill Phase – Building the Context

When you send a prompt, the model first processes the entire prompt in parallel. This is called the prefill phase.

During prefill, the model:

The result is the KV substrate — a large, passive collection of stored tensors consisting of all the Key and Value vectors computed so far. Think of it metaphorically as a rich contextual matrix — a frozen, holistic representation made up of the Key and Value vectors that together capture the model’s interpretation of the prompt: instructions, facts, tone, constraints, relationships between ideas, and more.

At the end of prefill, the model also produces a final hidden state for the last token in the prompt. This serves as the model’s internal summary of everything it has seen so far.

2. Decode Phase – Generating One Token at a Time

Once prefill is complete, the model switches to autoregressive generation — producing tokens one by one.

Inside each decode step:

  1. Start with the current hidden state (on the first step, this comes from prefill; afterward, it comes from the previous decode step).
  2. Process the newest token
    The model embeds the most recently sampled token and computes a new Query (Q), Key (K), and Value (V) for it.

    These new K and V vectors are appended to the KV substrate, becoming part of the holistic contextual representation that will influence all future tokens.
  3. Read from the KV Substrate
    The new Query performs attention over the entire KV substrate using two matrix operations:

    scores = Q_new × K_substrate^T / √d_head
    attention_output = softmax(scores) × V_substrate

    This step is where the model consults the accumulated, compiled understanding stored in the KV substrate.
  4. Update the hidden state
    The attention output flows through the feed-forward network (MLP) and other layers, producing a new final hidden state.
  5. Project to Vocabulary (The Critical Step)
    The final hidden state is passed through the language model head:

    logits = final_hidden_state × W_unembed^T

    This produces a vector of logits with length equal to the vocabulary size (usually 32,000 to 128,000+).
  6. Choose the Next Token
    The logits are converted into probabilities (using softmax, often with temperature, top-p, or top-k sampling). One token is selected and fed back into the model. The loop repeats.

Summary: The Full Picture

Prefill: Heavy parallel processing that builds the passive KV substrate (the rich contextual matrix made of Key and Value vectors) plus the initial hidden state.

Decode: A repeated loop that reads from the KV substrate via attention, updates the hidden state, projects to logits, and samples the next token.

The KV substrate itself is passive — it stores the compiled contextual information in the form of Key and Value vectors but performs no computation on its own. All the active work happens in the attention reads and transformer layers. The final projection step is what turns the model’s rich internal representation into actual fluent text choices.

Note on optimizations: Even when modern systems apply compression, eviction, or approximation techniques (such as SnapKV or quantization), they are still operating on a form of the KV substrate — just in a more compact or lossy version. The core concept remains the same: a queryable representation of the accumulated context.

This mechanical process — not magic — is what allows today’s LLMs to produce coherent, context-aware output. The KV substrate is the essential medium that makes the prompt’s intent persistently available throughout generation.

Why “KV substrate”?
Some might prefer calling this simply the “KV sequence,” which is accurate but somewhat undersells its significance. I use “KV substrate” because it better captures its role as the foundational medium on which all contextual understanding and generation depend. The term “cache” implies something optional — a mere performance trick. In truth, the exact same mathematical operations can be performed without any caching at all (though much more slowly). What is truly essential is this underlying structure: the compiled, holistic representation that carries the prompt’s intent forward.


This article is a companion to the main series on Understanding in LLMs.
It focuses on the actual mechanics of inference without anthropomorphism or over-simplification.