// term 65 · Memory & Context
Context Compression
Smaller Context Footprint
Reducing the token count of context while preserving what matters — summarizing histories, pruning irrelevance, and condensing documents so more meaning fits in less window. Compression is how systems stretch fixed context budgets across long sessions and large knowledge at sustainable cost.
// Typical ratio
5–10x
Token reduction achievable on conversational history and verbose documents before task-relevant fidelity degrades.
// Double payoff
cost + focus
Fewer tokens cut spend directly — and trimmed context often improves accuracy by removing distraction.
// Risk
lossy
Compression discards by design — the failure mode is dropping the one detail the next request needed.
// full definition
What Context Compression actually is
Context is a metered resource: every token in the window costs money per request, adds latency, and competes for the model's attention. Compression is the discipline of spending fewer tokens on the same meaning — summarizing what's settled, pruning what's irrelevant, condensing what's verbose — so that long conversations, large documents, and rich agent state fit inside budgets that fixed windows and finite wallets impose.
The standard techniques form a toolkit. Rolling summarization replaces aging conversation turns with condensed records, keeping recent exchanges verbatim while history shrinks to its decisions and facts. Relevance pruning drops retrieved passages that don't bear on the current question. Extractive compression keeps key sentences and discards connective tissue; learned compressors go further, trimming tokens that contribute little to model predictions. Structured-state designs sidestep prose entirely — distilling session state into compact facts-and-decisions records rather than narrative.
Compression's quiet second benefit is focus. Long, cluttered contexts measurably degrade model performance — relevant facts buried mid-window get missed, and irrelevant material invites distraction. Trimmed context often improves answer quality while cutting its price: less to attend to, more attention on what remains. The cost-quality relationship is not a pure trade; in the verbose middle ranges, compression wins on both axes simultaneously.
The engineering risk is what lossy means: discarded detail is gone, and the failure mode is needing it later — a summarized caveat, a pruned figure, the one turn that contextualized everything after it. Mature systems compress conservatively where stakes are high, keep originals retrievable (compress the context, not the archive), and evaluate compression against downstream task performance rather than summary aesthetics. The question is never whether the summary reads well — it's whether the system still answers correctly after compression did its work.
// how it works
Fitting more meaning into fewer tokens
Compression triages context — what stays verbatim, what survives as summary, what drops — under a token budget that never stops applying.
Budget Pressure
Accumulating history or bulky retrieval approaches the window and cost limits — compression's trigger condition.
Triage
Content classifies by current relevance and recency — verbatim-critical, summarizable, or droppable.
Condensation
Summarizers and extractors compress the middle tier — meaning preserved, token count collapsed.
Assembly
Verbatim recents, compressed history, and pruned retrieval compose into the working context — the budget met.
Archive Retention
Originals persist outside the window — retrievable when a compressed detail turns out to matter after all.
Task-Level Evaluation
Downstream answer quality measures the scheme — fidelity judged by outcomes, not by how the summary reads.
// anatomy
The components teams must understand
01
Rolling Summaries
History, condensed
Aging turns replaced by decision-and-fact records — the standard mechanic keeping long sessions inside fixed windows.
02
Relevance Pruning
Dropping the inert
Retrieved and historical content scored against the current task — what doesn't bear on it doesn't ride along.
03
Extractive Selection
Key sentences survive
High-signal spans kept verbatim, connective tissue cut — compression without paraphrase risk.
04
Learned Compressors
Model-aware trimming
Token-level importance scoring trained against model behavior — squeezing windows further than heuristics reach.
05
Structured State
Facts over narrative
Session memory as compact records — entities, decisions, open questions — rather than re-summarized prose.
06
Fidelity Evaluation
The outcome test
Task accuracy after compression versus before — the metric that catches schemes optimizing readability over utility.
// strategic implications
What this changes for the business
01 · Economics
Tokens saved are margin earned
Context re-sent across every turn of every session is a multiplied cost — compression cuts it at the source, often 5–10x on history-heavy workloads. For conversational products at volume, compression strategy is a direct line item on unit economics.
02 · Quality
Leaner context often answers better
Cluttered windows bury signal and invite distraction — trimming measurably improves recall of what remains. Compression isn't purely a cost concession; in verbose regimes it's a quality intervention with a rebate attached.
03 · Risk
Lossy means designed forgetting
Compression discards detail that later requests may need — the failure surfaces downstream and blames itself poorly. Keep originals retrievable, compress conservatively where stakes rise, and evaluate schemes on task outcomes rather than summary quality.
// common misconceptions
What Context Compression is not
Myth
“Bigger context windows make compression unnecessary.”
Reality
Larger windows raise the ceiling, not the economics — tokens still bill per request, latency still scales, and attention still dilutes. Compression pays at every window size; bigger windows just move where.
Myth
“A good summary preserves what matters.”
Reality
Summaries preserve what the summarizer judged salient — which may not include the caveat the next question turns on. Fidelity is task-relative and measured downstream, not a property of well-written prose.
Myth
“Compression is just summarization.”
Reality
Summarization is one tool among several — pruning, extraction, learned token trimming, and structured state each compress differently. Production schemes compose them by content type and stakes.
// from literacy to leverage
Know the term. Now build the strategy.
Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.