// term 59 · Retrieval & Knowledge
Chunking
Document Segmentation Process
Splitting documents into retrieval-sized segments before embedding and indexing — the preprocessing decision that quietly sets the ceiling on RAG quality. Chunk too large and retrieval loses precision; too small and retrieved fragments lose the context that made them meaningful.
// Typical size
200–800 tokens
The working range for most RAG chunks — balanced between embedding focus and contextual completeness.
// Standard aid
10–20% overlap
Adjacent chunks sharing boundary content — insurance against ideas severed mid-thought at split points.
// Verdict
structure wins
Splitting on headings, paragraphs, and semantic boundaries consistently outperforms fixed-size cutting in retrieval evaluations.
// full definition
What Chunking actually is
Every RAG system makes a decision before any query arrives: how to cut documents into the segments that will be embedded, indexed, and retrieved. The decision looks like plumbing and acts like destiny. An embedding represents its chunk's overall meaning — a chunk spanning five topics embeds as a muddy average no query matches well; a chunk of one sentence matches sharply but arrives stripped of the context that made it true. Retrieval quality, and therefore generation quality, is bounded by what the chunks made findable.
The granularity trade is the core tension. Large chunks preserve context but dilute embeddings and waste prompt budget on irrelevant surroundings; small chunks sharpen matching but fragment meaning — the retrieved sentence whose critical caveat lived in the paragraph above it. Overlap between adjacent chunks softens boundary damage, and the working optimum for most corpora sits in the few-hundred-token range — but the honest answer is empirical, and corpus-specific.
Strategy has converged on respecting structure. Documents carry their own segmentation — sections, headings, paragraphs, list items, table boundaries — and splitting along it consistently beats fixed-size cutting, because human-authored boundaries already wrap coherent units of meaning. Semantic chunking goes further, detecting topic shifts by embedding similarity; hierarchical schemes index small chunks for matching while returning their parent sections for context; metadata (source, section, headings) rides along to preserve provenance. Real-world parsing — PDFs, tables, slide decks — is where these pipelines earn their keep or quietly fail.
The operational guidance is unglamorous: chunking is a tunable system parameter, not a set-and-forget default. Retrieval evaluation against representative queries — testing sizes, boundaries, and overlap — routinely yields double-digit retrieval gains over framework defaults, making it among the highest-ROI optimizations in any RAG stack. When a RAG system underperforms, audit the chunks before blaming the model: read what retrieval actually returned, and the fragmentation usually explains the hallucination.
// how it works
Cutting documents without cutting meaning
Chunking decides what retrieval can find and what generation gets to read — a preprocessing step with system-wide consequences.
Document Parsing
Source formats — PDFs, wikis, slides, tables — convert to clean text with structure preserved. Garbage here propagates everywhere.
Boundary Selection
Split points are chosen — by structure, semantics, or size — deciding what units of meaning will exist for retrieval.
Overlap & Sizing
Chunk dimensions and boundary overlap are set — the granularity trade tuned to corpus and query patterns.
Metadata Attachment
Source, section, and heading context rides with each chunk — provenance for citation, hierarchy for assembly.
Embedding & Indexing
Chunks vectorize and land in the index — the segmentation decisions now baked into everything retrievable.
Retrieval Evaluation
Representative queries score the scheme — the feedback loop that turns a default into a tuned parameter.
// anatomy
The components teams must understand
01
Chunk Size
The granularity dial
Tokens per segment — trading embedding sharpness against contextual completeness, with an empirically corpus-specific optimum.
02
Overlap
Boundary insurance
Shared content between adjacent chunks — protecting ideas that would otherwise be severed at split points.
03
Structural Splitting
Respecting the author
Headings, paragraphs, and sections as boundaries — human-authored coherence reused as retrieval coherence.
04
Semantic Splitting
Topic-shift detection
Embedding-similarity breaks where subjects change — boundaries by meaning where structure is absent or unreliable.
05
Hierarchical Schemes
Match small, return large
Fine chunks for sharp retrieval, parent sections for complete context — the both-ways answer to the granularity trade.
06
Parsing Layer
The unglamorous gate
PDF, table, and layout extraction quality — where enterprise chunking pipelines actually succeed or silently fail.
// strategic implications
What this changes for the business
01 · Leverage
The cheapest big lever in RAG
Chunking tuning routinely moves retrieval quality double digits over defaults — no model change, no infrastructure, just evaluation and adjustment. Before upgrading embeddings or models, audit and tune the segmentation; it's the highest ROI step most teams skip.
02 · Debugging
Read the chunks before blaming the model
When RAG hallucinates or misses, the retrieved fragments usually tell the story — context severed at boundaries, topics blurred in oversized chunks, tables mangled in parsing. Chunk audits convert mysterious generation failures into fixable preprocessing defects.
03 · Content
Document hygiene becomes retrieval performance
Well-structured source documents — clear headings, coherent sections, clean formatting — chunk well automatically; sprawling unstructured ones fight every scheme. Authoring standards for knowledge bases now have a measurable AI payoff worth governing toward.
// common misconceptions
What Chunking is not
Myth
“Chunking is a default setting, not a decision.”
Reality
Framework defaults are calibrated to nobody's corpus — and segmentation caps retrieval system-wide. The teams that evaluate and tune chunking outperform identical stacks that didn't, consistently and measurably.
Myth
“Bigger chunks mean more context, so bigger is safer.”
Reality
Oversized chunks embed as muddy averages that match queries poorly, and they spend prompt budget on irrelevant surroundings. Context per chunk and findability trade off — the optimum is balanced, not maximal.
Myth
“Long-context models make chunking obsolete.”
Reality
Long context changes how much retrieved material fits, not whether retrieval needs sharp segments to find the right material first. Precision economics keep chunking decisions alive at every window size.
// from literacy to leverage
Know the term. Now build the strategy.
Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.