// term 04 · Memory & Context
Context Window
Operational Memory Limit
The maximum number of tokens a model can attend to in a single request — its entire working memory. Everything the model knows about your task right now must fit inside this window: instructions, conversation history, retrieved documents, and the response it is generating.
// Range
8K–1M+
Tokens across current production models. A 200K window holds roughly a 500-page book; 1M approaches a small codebase.
// Scaling
O(n²)
Attention compute grows quadratically with sequence length in standard transformers — the structural reason long context costs more and runs slower.
// Recall
−20–40%
Typical accuracy degradation for facts buried mid-context in very long prompts — the documented “lost in the middle” effect.
// full definition
What Context Window actually is
The context window is the model's only working memory. LLMs are stateless between requests: nothing persists from one call to the next unless the application re-sends it or retrieves it from external storage. The continuity users perceive in a long conversation is engineered — accumulated history replayed into each request — not remembered by the model.
The window is a budget under constant pressure. System prompts and tool definitions are a fixed tax on every call; conversation history grows linearly; retrieved documents arrive in bulk; and the response itself must fit in whatever remains. Long sessions eventually evict their own beginnings. Context assembly — deciding what makes it into each request — is where most quality wins and most regressions originate.
Bigger windows are not free. Attention compute scales quadratically with sequence length, per-token serving memory (the KV cache) scales with context times concurrency, and input tokens are billed on every request — resending a 100K-token context across a 20-turn conversation burns two million tokens before any output. Worse, recall is uneven: models retrieve facts placed at the start and end of long contexts far more reliably than material buried in the middle.
The strategic consequence: long context and retrieval are complements, not rivals. Million-token windows expand what is possible — whole-codebase reasoning, book-length analysis — but production economics still favor retrieving precisely and spending the window on what matters. Teams that treat context as an engineered, budgeted resource consistently outperform teams that treat it as free real estate.
// how it works
How the window gets spent
Every request is a zero-sum allocation of a fixed token budget — understanding the line items is the foundation of context engineering.
System Prompt
Persona, policies, and tool definitions are prepended to every call — a fixed tax on the budget before any work begins.
Conversation History
Prior turns accumulate linearly. Without summarization or pruning, long sessions eventually evict their own beginnings.
Retrieved Context
RAG pipelines inject documents at request time — typically the largest and most variable line item in enterprise workloads.
Attention Pass
Every token attends to every other token in the window. This is where the quadratic compute bill is paid.
KV Cache
Per-token attention state is held in GPU memory throughout generation — long contexts consume serving memory as well as compute.
Output Budget
The response shares the same window. A maxed-out input leaves no room to answer — output reservation is part of the design.
// anatomy
The components teams must understand
01
Window Size
Hard architectural ceiling
Fixed per model at training and serving time. Exceeding it does not degrade gracefully — content is truncated or the request fails outright.
02
Positional Encoding Range
Why the limit exists
Models learn token positions up to a trained maximum. Extension techniques like RoPE scaling stretch it, but quality at extreme lengths varies sharply by model.
03
Effective Context
Usable vs advertised
Benchmark recall at the advertised limit often trails the headline number. Evaluate the effective window on your task, not the marketing figure.
04
KV Cache
The serving-memory bill
Attention state scales with context length × concurrent users. Long-context features carry real infrastructure costs that surface in pricing.
05
Context Assembly
The orchestration layer
Application code deciding what enters each request — history, retrievals, tools. The highest-leverage and least-visible layer of system quality.
06
Compression & Caching
Stretching the budget
Rolling summaries, selective retention, and prompt caching extend effective memory and cut re-send costs beyond what the raw window allows.
// strategic implications
What this changes for the business
01 · Architecture
Long context and RAG are complements, not rivals
Million-token windows do not eliminate retrieval: stuffing everything into context costs more, runs slower, and degrades mid-context recall. The winning pattern retrieves precisely and spends the window on what matters. Architect for selective context regardless of window size — the economics reward precision at every scale.
02 · Economics
Window usage is a direct cost dial
Input tokens are billed on every request — a 100K-token context resent across a 20-turn conversation consumes two million tokens before any output. Prompt caching, history summarization, and per-feature context budgets are margin levers worth real money at production volume.
03 · Product
Session memory is a designed feature
Models remember nothing between requests. The continuity users perceive is engineered: persisted state, summarized history, retrieved memory. Teams that treat memory as product infrastructure ship better assistants than teams that lean on raw window size — and they do it at lower cost.
// common misconceptions
What Context Window is not
Myth
“The model remembers our previous conversations.”
Reality
Each request is stateless. Anything “remembered” was re-sent in the window or retrieved from external storage by the application layer. Memory is a system you build, not a property the model has.
Myth
“A big enough window means we can skip retrieval engineering.”
Reality
Long context shifts the retrieval bar; it does not remove it. Cost, latency, and mid-context recall degradation all penalize indiscriminate stuffing — precision retrieval keeps winning in production on every axis that matters.
Myth
“All positions in the window perform equally.”
Reality
Recall is strongest at the start and end of long contexts and measurably weaker in the middle. Placement of critical instructions and facts is a real engineering variable — and a free quality win once you know it exists.
// from literacy to leverage
Know the term. Now build the strategy.
Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.