# Context Window — Operational Memory Limit

> The maximum number of tokens a model can attend to in a single request — its entire working memory. Everything the model knows about your task right now must fit inside this window: instructions, conversation history, retrieved documents, and the response it is generating.

**Canonical URL:** https://www.andekian.com/ai-lexicon/context-window  
**Author / Site:** Stephen Andekian — https://www.andekian.com

**Term 04 of 100** · Memory & Context  
**Tags:** Context, Attention, KV Cache, Long Context

## Key Stats

- **Range — 8K–1M+:** Tokens across current production models. A 200K window holds roughly a 500-page book; 1M approaches a small codebase.
- **Scaling — O(n²):** Attention compute grows quadratically with sequence length in standard transformers — the structural reason long context costs more and runs slower.
- **Recall — −20–40%:** Typical accuracy degradation for facts buried mid-context in very long prompts — the documented “lost in the middle” effect.

## What Context Window Actually Is

The context window is the model's only working memory. LLMs are stateless between requests: nothing persists from one call to the next unless the application re-sends it or retrieves it from external storage. The continuity users perceive in a long conversation is engineered — accumulated history replayed into each request — not remembered by the model.

The window is a budget under constant pressure. System prompts and tool definitions are a fixed tax on every call; conversation history grows linearly; retrieved documents arrive in bulk; and the response itself must fit in whatever remains. Long sessions eventually evict their own beginnings. Context assembly — deciding what makes it into each request — is where most quality wins and most regressions originate.

Bigger windows are not free. Attention compute scales quadratically with sequence length, per-token serving memory (the KV cache) scales with context times concurrency, and input tokens are billed on every request — resending a 100K-token context across a 20-turn conversation burns two million tokens before any output. Worse, recall is uneven: models retrieve facts placed at the start and end of long contexts far more reliably than material buried in the middle.

The strategic consequence: long context and retrieval are complements, not rivals. Million-token windows expand what is possible — whole-codebase reasoning, book-length analysis — but production economics still favor retrieving precisely and spending the window on what matters. Teams that treat context as an engineered, budgeted resource consistently outperform teams that treat it as free real estate.

## How It Works: How the window gets spent

Every request is a zero-sum allocation of a fixed token budget — understanding the line items is the foundation of context engineering.

1. **System Prompt** — Persona, policies, and tool definitions are prepended to every call — a fixed tax on the budget before any work begins.
2. **Conversation History** — Prior turns accumulate linearly. Without summarization or pruning, long sessions eventually evict their own beginnings.
3. **Retrieved Context** — RAG pipelines inject documents at request time — typically the largest and most variable line item in enterprise workloads.
4. **Attention Pass** — Every token attends to every other token in the window. This is where the quadratic compute bill is paid.
5. **KV Cache** — Per-token attention state is held in GPU memory throughout generation — long contexts consume serving memory as well as compute.
6. **Output Budget** — The response shares the same window. A maxed-out input leaves no room to answer — output reservation is part of the design.

## Anatomy: The Components Teams Must Understand

- **Window Size** (Hard architectural ceiling): Fixed per model at training and serving time. Exceeding it does not degrade gracefully — content is truncated or the request fails outright.
- **Positional Encoding Range** (Why the limit exists): Models learn token positions up to a trained maximum. Extension techniques like RoPE scaling stretch it, but quality at extreme lengths varies sharply by model.
- **Effective Context** (Usable vs advertised): Benchmark recall at the advertised limit often trails the headline number. Evaluate the effective window on your task, not the marketing figure.
- **KV Cache** (The serving-memory bill): Attention state scales with context length × concurrent users. Long-context features carry real infrastructure costs that surface in pricing.
- **Context Assembly** (The orchestration layer): Application code deciding what enters each request — history, retrievals, tools. The highest-leverage and least-visible layer of system quality.
- **Compression & Caching** (Stretching the budget): Rolling summaries, selective retention, and prompt caching extend effective memory and cut re-send costs beyond what the raw window allows.

## Strategic Implications

- **Long context and RAG are complements, not rivals** (01 · Architecture): Million-token windows do not eliminate retrieval: stuffing everything into context costs more, runs slower, and degrades mid-context recall. The winning pattern retrieves precisely and spends the window on what matters. Architect for selective context regardless of window size — the economics reward precision at every scale.
- **Window usage is a direct cost dial** (02 · Economics): Input tokens are billed on every request — a 100K-token context resent across a 20-turn conversation consumes two million tokens before any output. Prompt caching, history summarization, and per-feature context budgets are margin levers worth real money at production volume.
- **Session memory is a designed feature** (03 · Product): Models remember nothing between requests. The continuity users perceive is engineered: persisted state, summarized history, retrieved memory. Teams that treat memory as product infrastructure ship better assistants than teams that lean on raw window size — and they do it at lower cost.

## Common Misconceptions

- **Myth:** “The model remembers our previous conversations.”  
  **Reality:** Each request is stateless. Anything “remembered” was re-sent in the window or retrieved from external storage by the application layer. Memory is a system you build, not a property the model has.
- **Myth:** “A big enough window means we can skip retrieval engineering.”  
  **Reality:** Long context shifts the retrieval bar; it does not remove it. Cost, latency, and mid-context recall degradation all penalize indiscriminate stuffing — precision retrieval keeps winning in production on every axis that matters.
- **Myth:** “All positions in the window perform equally.”  
  **Reality:** Recall is strongest at the start and end of long contexts and measurably weaker in the middle. Placement of critical instructions and facts is a real engineering variable — and a free quality win once you know it exists.

## Related Terms

- [LLM — Large Language Model](https://www.andekian.com/ai-lexicon/llm)
- [Token — Unit Of AI Processing](https://www.andekian.com/ai-lexicon/token)
- [RAG — Retrieval-Augmented Generation](https://www.andekian.com/ai-lexicon/rag)
- [Chunking — Document Segmentation Process](https://www.andekian.com/ai-lexicon/chunking)
- [Context Injection — Dynamic Information Insertion](https://www.andekian.com/ai-lexicon/context-injection)
- [Context Compression — Smaller Context Footprint](https://www.andekian.com/ai-lexicon/context-compression)
- [Long-Term Memory — Persistent Contextual Storage](https://www.andekian.com/ai-lexicon/long-term-memory)
- [Short-Term Memory — Active Session Awareness](https://www.andekian.com/ai-lexicon/short-term-memory)

## Explore the Full Lexicon

All 100 terms: https://www.andekian.com/ai-lexicon

## Contact

Book a conversation or send an inquiry: https://www.andekian.com/#contact
LinkedIn: https://www.linkedin.com/in/andekian/