// term 04 · Memory & Context

Context Window

Operational Memory Limit

The maximum number of tokens a model can attend to in a single request — its entire working memory. Everything the model knows about your task right now must fit inside this window: instructions, conversation history, retrieved documents, and the response it is generating.

ContextAttentionKV CacheLong Context

// Range

8K–1M+

Tokens across current production models. A 200K window holds roughly a 500-page book; 1M approaches a small codebase.

// Scaling

O(n²)

Attention compute grows quadratically with sequence length in standard transformers — the structural reason long context costs more and runs slower.

// Recall

−20–40%

Typical accuracy degradation for facts buried mid-context in very long prompts — the documented “lost in the middle” effect.

// full definition

What Context Window actually is

The context window is the model's only working memory. LLMs are stateless between requests: nothing persists from one call to the next unless the application re-sends it or retrieves it from external storage. The continuity users perceive in a long conversation is engineered — accumulated history replayed into each request — not remembered by the model.

The window is a budget under constant pressure. System prompts and tool definitions are a fixed tax on every call; conversation history grows linearly; retrieved documents arrive in bulk; and the response itself must fit in whatever remains. Long sessions eventually evict their own beginnings. Context assembly — deciding what makes it into each request — is where most quality wins and most regressions originate.

Bigger windows are not free. Attention compute scales quadratically with sequence length, per-token serving memory (the KV cache) scales with context times concurrency, and input tokens are billed on every request — resending a 100K-token context across a 20-turn conversation burns two million tokens before any output. Worse, recall is uneven: models retrieve facts placed at the start and end of long contexts far more reliably than material buried in the middle.

The strategic consequence: long context and retrieval are complements, not rivals. Million-token windows expand what is possible — whole-codebase reasoning, book-length analysis — but production economics still favor retrieving precisely and spending the window on what matters. Teams that treat context as an engineered, budgeted resource consistently outperform teams that treat it as free real estate.

// how it works

How the window gets spent

Every request is a zero-sum allocation of a fixed token budget — understanding the line items is the foundation of context engineering.

System Prompt

Persona, policies, and tool definitions are prepended to every call — a fixed tax on the budget before any work begins.

Conversation History

Prior turns accumulate linearly. Without summarization or pruning, long sessions eventually evict their own beginnings.

Retrieved Context

RAG pipelines inject documents at request time — typically the largest and most variable line item in enterprise workloads.

Attention Pass

Every token attends to every other token in the window. This is where the quadratic compute bill is paid.

KV Cache

Per-token attention state is held in GPU memory throughout generation — long contexts consume serving memory as well as compute.

Output Budget

The response shares the same window. A maxed-out input leaves no room to answer — output reservation is part of the design.

// anatomy

The components teams must understand

Window Size

Hard architectural ceiling

Fixed per model at training and serving time. Exceeding it does not degrade gracefully — content is truncated or the request fails outright.

Positional Encoding Range

Why the limit exists

Models learn token positions up to a trained maximum. Extension techniques like RoPE scaling stretch it, but quality at extreme lengths varies sharply by model.

Effective Context

Usable vs advertised

Benchmark recall at the advertised limit often trails the headline number. Evaluate the effective window on your task, not the marketing figure.

KV Cache

The serving-memory bill

Attention state scales with context length × concurrent users. Long-context features carry real infrastructure costs that surface in pricing.

Context Assembly

The orchestration layer

Application code deciding what enters each request — history, retrievals, tools. The highest-leverage and least-visible layer of system quality.

Compression & Caching

Stretching the budget

Rolling summaries, selective retention, and prompt caching extend effective memory and cut re-send costs beyond what the raw window allows.

// strategic implications

What this changes for the business

01 · Architecture

Long context and RAG are complements, not rivals

Million-token windows do not eliminate retrieval: stuffing everything into context costs more, runs slower, and degrades mid-context recall. The winning pattern retrieves precisely and spends the window on what matters. Architect for selective context regardless of window size — the economics reward precision at every scale.

02 · Economics

Window usage is a direct cost dial

Input tokens are billed on every request — a 100K-token context resent across a 20-turn conversation consumes two million tokens before any output. Prompt caching, history summarization, and per-feature context budgets are margin levers worth real money at production volume.

03 · Product

Session memory is a designed feature

Models remember nothing between requests. The continuity users perceive is engineered: persisted state, summarized history, retrieved memory. Teams that treat memory as product infrastructure ship better assistants than teams that lean on raw window size — and they do it at lower cost.

// common misconceptions

What Context Window is not

Myth

“The model remembers our previous conversations.”

Reality

Each request is stateless. Anything “remembered” was re-sent in the window or retrieved from external storage by the application layer. Memory is a system you build, not a property the model has.

Myth

“A big enough window means we can skip retrieval engineering.”

Reality

Long context shifts the retrieval bar; it does not remove it. Cost, latency, and mid-context recall degradation all penalize indiscriminate stuffing — precision retrieval keeps winning in production on every axis that matters.

Myth

“All positions in the window perform equally.”

Reality

Recall is strongest at the start and end of long contexts and measurably weaker in the middle. Placement of critical instructions and facts is a real engineering variable — and a free quality win once you know it exists.

// from literacy to leverage

Know the term. Now build the strategy.

Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.

AI innovation, applied

Context Window

What Context Window actually is

How the window gets spent

The components teams must understand

What this changes for the business

What Context Window is not

Explore the wider architecture

Know the term. Now build the strategy.