The Two Operations: How the KV Substrate Grows and Compresses

In the previous chapters we saw how the prefill phase builds the KV substrate — the rich collection of Key and Value vectors that carries the model's compiled understanding forward — and how the decode phase generates tokens one at a time by reading from that substrate. We also saw that prompt templates and Chain-of-Thought techniques can shape the substrate deliberately, improving the quality and reliability of the output.

This raises a simpler, more fundamental question: what happens to the KV substrate over the course of an actual conversation?

The answer involves just two operations: expansion and reduction. Understanding these two operations — and recognizing that you already perform both of them intuitively — is the key to understanding why conversations with LLMs behave the way they do.

Expansion: The Substrate Grows

Every time you send a message and the model responds, the KV substrate gets larger.

Your message is processed during a new prefill pass. Fresh Key and Value vectors are computed for every token you typed and appended to the existing substrate. Then, during the decode phase, each token the model generates also gets its own Key and Value vectors appended. By the time the model finishes its response, the substrate has grown by the combined length of your message and the model's reply.

On the next turn, you type again. More tokens. More Key and Value vectors appended. The model responds. More vectors appended. The substrate keeps growing.

This is why conversations tend to get richer as they progress. The substrate becomes denser with each exchange. When you refer to something you discussed five messages ago, the model can attend to those earlier Key and Value vectors because they are still present in the substrate. The context accumulates. The model's responses become more informed, more connected to the full history of the conversation.

This is also why a well-structured conversation — where each turn builds meaningfully on the last — produces dramatically better results than a series of disconnected questions. Each turn adds context to the substrate that future turns can reference.

Reduction: The Substrate Compresses

The KV substrate cannot grow forever. Every model has a finite context window — a maximum number of tokens it can hold in its substrate at one time. Current frontier models typically support between 128,000 and 200,000 tokens. This sounds like a lot, but in a long, detailed conversation it fills up faster than most people expect.

When the substrate approaches its limit, something has to give. This is where reduction comes in.

The simplest and most familiar form of reduction is one you already practice without thinking about it: starting a new AI chat session.

This is something humans do with each other all the time. In the middle of a long, complicated explanation that has gone off track, someone will stop and say "forget everything I just said" — then take a breath and start over with a clearer, more focused attempt. They are not erasing their own understanding. They are asking the listener to discard the muddled context so they can rebuild it more cleanly from scratch. The knowledge is still there. The delivery is what gets reset.

Starting a fresh chat session with an LLM is the same instinct. The model's KV substrate is emptied. None of the context from your previous conversation is present. The substrate has been reduced to zero — the most aggressive reduction possible.

But here is the subtle and important part: your own context has not been reduced to zero.

You remember what you discussed in the previous conversation. You remember the key ideas, the important conclusions, the things that worked and the things that didn't. When you type your first message in the new session, you are drawing on your own accumulated understanding — your own mental substrate — to craft that opening prompt. You are performing a human reduction operation: compressing everything you learned from the previous conversation into a compact new prompt that carries the essential intent forward.

This is worth pausing on. Every time you start a fresh conversation with an LLM, you are unconsciously performing a reduction of the prior context. You are deciding what matters enough to carry forward and what can be left behind. The quality of your opening prompt in a new session is directly shaped by the richness of your own understanding — by everything you absorbed from previous conversations, from your own experience, and from your own thinking.

Your KV substrate — the lived accumulation of everything you know and have experienced — is what generates the tokens you type into that empty text field.

The Spectrum of Reduction

Starting a new AI chat session is the most aggressive form of reduction: the model's substrate goes to zero, and you manually reconstruct the essential context from your own memory. But there are less aggressive forms of reduction that happen automatically.

Modern chat systems perform compaction — an automated reduction that summarizes the conversation history when the context window approaches its limit. Instead of losing everything, the system attempts to preserve the essential information in a compressed form while discarding surface details and redundant exchanges.

These automated reduction methods sit on a spectrum:

At one end is crude truncation — simply cutting off the oldest messages when the window fills up. This is fast but lossy. Important context from early in the conversation can vanish without warning.

At the other end is intelligent compaction — summarizing the conversation in a way that preserves the essential relationships, key decisions, and important context while discarding the back-and-forth mechanics of how those conclusions were reached. This is more expensive to perform but preserves far more of what matters.

The ideal reduction would preserve everything essential while using the minimum possible number of tokens. In practice, every reduction method involves some loss. The question is always how much fidelity is sacrificed and whether the surviving representation is still rich enough to support coherent continuation.

You Already Know This

If you have ever had a long conversation with an LLM that started producing vague, repetitive, or disconnected responses — you have experienced the consequences of a substrate that has been reduced poorly or has grown so large that the model struggles to attend to the most relevant parts.

If you have ever started a fresh conversation and found that a carefully crafted opening prompt produced better results than continuing a muddled old thread — you have experienced the power of human reduction. You compressed the essential insight into a clean seed and gave the model a fresh, high-quality substrate to work with.

If you have ever noticed that the quality of a conversation depends heavily on what you say in the first few messages — you have experienced the outsized importance of early tokens in the substrate. Those first tokens carry strong contextual signals that influence everything that follows.

These are not advanced technical insights. They are things every regular user of LLMs has felt. The expansion and reduction framework simply gives those experiences a name and a mechanical explanation.

The Human Side of the Loop

There is one more thing worth making explicit, because it is easy to overlook.

When you sit down to start a conversation — especially a fresh one with an empty context window — the prompt you type is not coming from nowhere. It is coming from you. From your experience, your knowledge, your curiosity, your confusion, your specific way of seeing the world.

You are the upstream substrate. Your lived experience is the rich, accumulated context from which your prompts are generated. The model's KV substrate is built from your tokens. Your tokens are built from your understanding. The quality of the entire process — expansion, reduction, and everything the model produces — traces back to what you bring to the conversation.

The two operations — expansion and reduction — are mechanical. But what drives them is human. The model expands what you give it. You reduce what matters. The loop between these two operations is the conversation itself.

In the next chapter, we will look at what happens when the model's output is no longer just text in a chat — but working files that expand a prompt into something you can actually run.