// term 65 · Memory & Context

Context Compression

Smaller Context Footprint

Reducing the token count of context while preserving what matters — summarizing histories, pruning irrelevance, and condensing documents so more meaning fits in less window. Compression is how systems stretch fixed context budgets across long sessions and large knowledge at sustainable cost.

SummarizationToken EfficiencyPruningCost

// Typical ratio

5–10x

Token reduction achievable on conversational history and verbose documents before task-relevant fidelity degrades.

// Double payoff

cost + focus

Fewer tokens cut spend directly — and trimmed context often improves accuracy by removing distraction.

// Risk

lossy

Compression discards by design — the failure mode is dropping the one detail the next request needed.

// full definition

What Context Compression actually is

Context is a metered resource: every token in the window costs money per request, adds latency, and competes for the model's attention. Compression is the discipline of spending fewer tokens on the same meaning — summarizing what's settled, pruning what's irrelevant, condensing what's verbose — so that long conversations, large documents, and rich agent state fit inside budgets that fixed windows and finite wallets impose.

The standard techniques form a toolkit. Rolling summarization replaces aging conversation turns with condensed records, keeping recent exchanges verbatim while history shrinks to its decisions and facts. Relevance pruning drops retrieved passages that don't bear on the current question. Extractive compression keeps key sentences and discards connective tissue; learned compressors go further, trimming tokens that contribute little to model predictions. Structured-state designs sidestep prose entirely — distilling session state into compact facts-and-decisions records rather than narrative.

Compression's quiet second benefit is focus. Long, cluttered contexts measurably degrade model performance — relevant facts buried mid-window get missed, and irrelevant material invites distraction. Trimmed context often improves answer quality while cutting its price: less to attend to, more attention on what remains. The cost-quality relationship is not a pure trade; in the verbose middle ranges, compression wins on both axes simultaneously.

The engineering risk is what lossy means: discarded detail is gone, and the failure mode is needing it later — a summarized caveat, a pruned figure, the one turn that contextualized everything after it. Mature systems compress conservatively where stakes are high, keep originals retrievable (compress the context, not the archive), and evaluate compression against downstream task performance rather than summary aesthetics. The question is never whether the summary reads well — it's whether the system still answers correctly after compression did its work.

// how it works

Fitting more meaning into fewer tokens

Compression triages context — what stays verbatim, what survives as summary, what drops — under a token budget that never stops applying.

Budget Pressure

Accumulating history or bulky retrieval approaches the window and cost limits — compression's trigger condition.

Triage

Content classifies by current relevance and recency — verbatim-critical, summarizable, or droppable.

Condensation

Summarizers and extractors compress the middle tier — meaning preserved, token count collapsed.

Assembly

Verbatim recents, compressed history, and pruned retrieval compose into the working context — the budget met.

Archive Retention

Originals persist outside the window — retrievable when a compressed detail turns out to matter after all.

Task-Level Evaluation

Downstream answer quality measures the scheme — fidelity judged by outcomes, not by how the summary reads.

// anatomy

The components teams must understand

Rolling Summaries

History, condensed

Aging turns replaced by decision-and-fact records — the standard mechanic keeping long sessions inside fixed windows.

Relevance Pruning

Dropping the inert

Retrieved and historical content scored against the current task — what doesn't bear on it doesn't ride along.

Extractive Selection

Key sentences survive

High-signal spans kept verbatim, connective tissue cut — compression without paraphrase risk.

Learned Compressors

Model-aware trimming

Token-level importance scoring trained against model behavior — squeezing windows further than heuristics reach.

Structured State

Facts over narrative

Session memory as compact records — entities, decisions, open questions — rather than re-summarized prose.

Fidelity Evaluation

The outcome test

Task accuracy after compression versus before — the metric that catches schemes optimizing readability over utility.

// strategic implications

What this changes for the business

01 · Economics

Tokens saved are margin earned

Context re-sent across every turn of every session is a multiplied cost — compression cuts it at the source, often 5–10x on history-heavy workloads. For conversational products at volume, compression strategy is a direct line item on unit economics.

02 · Quality

Leaner context often answers better

Cluttered windows bury signal and invite distraction — trimming measurably improves recall of what remains. Compression isn't purely a cost concession; in verbose regimes it's a quality intervention with a rebate attached.

03 · Risk

Lossy means designed forgetting

Compression discards detail that later requests may need — the failure surfaces downstream and blames itself poorly. Keep originals retrievable, compress conservatively where stakes rise, and evaluate schemes on task outcomes rather than summary quality.

// common misconceptions

What Context Compression is not

Myth

“Bigger context windows make compression unnecessary.”

Reality

Larger windows raise the ceiling, not the economics — tokens still bill per request, latency still scales, and attention still dilutes. Compression pays at every window size; bigger windows just move where.

Myth

“A good summary preserves what matters.”

Reality

Summaries preserve what the summarizer judged salient — which may not include the caveat the next question turns on. Fidelity is task-relative and measured downstream, not a property of well-written prose.

Myth

“Compression is just summarization.”

Reality

Summarization is one tool among several — pruning, extraction, learned token trimming, and structured state each compress differently. Production schemes compose them by content type and stakes.

// from literacy to leverage

Know the term. Now build the strategy.

Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.

AI innovation, applied

Context Compression

What Context Compression actually is

Fitting more meaning into fewer tokens

The components teams must understand

What this changes for the business

What Context Compression is not

Explore the wider architecture

Know the term. Now build the strategy.