// term 59 · Retrieval & Knowledge

Chunking

Document Segmentation Process

Splitting documents into retrieval-sized segments before embedding and indexing — the preprocessing decision that quietly sets the ceiling on RAG quality. Chunk too large and retrieval loses precision; too small and retrieved fragments lose the context that made them meaningful.

SegmentationRAG QualityGranularityContext

// Typical size

200–800 tokens

The working range for most RAG chunks — balanced between embedding focus and contextual completeness.

// Standard aid

10–20% overlap

Adjacent chunks sharing boundary content — insurance against ideas severed mid-thought at split points.

// Verdict

structure wins

Splitting on headings, paragraphs, and semantic boundaries consistently outperforms fixed-size cutting in retrieval evaluations.

// full definition

What Chunking actually is

Every RAG system makes a decision before any query arrives: how to cut documents into the segments that will be embedded, indexed, and retrieved. The decision looks like plumbing and acts like destiny. An embedding represents its chunk's overall meaning — a chunk spanning five topics embeds as a muddy average no query matches well; a chunk of one sentence matches sharply but arrives stripped of the context that made it true. Retrieval quality, and therefore generation quality, is bounded by what the chunks made findable.

The granularity trade is the core tension. Large chunks preserve context but dilute embeddings and waste prompt budget on irrelevant surroundings; small chunks sharpen matching but fragment meaning — the retrieved sentence whose critical caveat lived in the paragraph above it. Overlap between adjacent chunks softens boundary damage, and the working optimum for most corpora sits in the few-hundred-token range — but the honest answer is empirical, and corpus-specific.

Strategy has converged on respecting structure. Documents carry their own segmentation — sections, headings, paragraphs, list items, table boundaries — and splitting along it consistently beats fixed-size cutting, because human-authored boundaries already wrap coherent units of meaning. Semantic chunking goes further, detecting topic shifts by embedding similarity; hierarchical schemes index small chunks for matching while returning their parent sections for context; metadata (source, section, headings) rides along to preserve provenance. Real-world parsing — PDFs, tables, slide decks — is where these pipelines earn their keep or quietly fail.

The operational guidance is unglamorous: chunking is a tunable system parameter, not a set-and-forget default. Retrieval evaluation against representative queries — testing sizes, boundaries, and overlap — routinely yields double-digit retrieval gains over framework defaults, making it among the highest-ROI optimizations in any RAG stack. When a RAG system underperforms, audit the chunks before blaming the model: read what retrieval actually returned, and the fragmentation usually explains the hallucination.

// how it works

Cutting documents without cutting meaning

Chunking decides what retrieval can find and what generation gets to read — a preprocessing step with system-wide consequences.

Document Parsing

Source formats — PDFs, wikis, slides, tables — convert to clean text with structure preserved. Garbage here propagates everywhere.

Boundary Selection

Split points are chosen — by structure, semantics, or size — deciding what units of meaning will exist for retrieval.

Overlap & Sizing

Chunk dimensions and boundary overlap are set — the granularity trade tuned to corpus and query patterns.

Metadata Attachment

Source, section, and heading context rides with each chunk — provenance for citation, hierarchy for assembly.

Embedding & Indexing

Chunks vectorize and land in the index — the segmentation decisions now baked into everything retrievable.

Retrieval Evaluation

Representative queries score the scheme — the feedback loop that turns a default into a tuned parameter.

// anatomy

The components teams must understand

Chunk Size

The granularity dial

Tokens per segment — trading embedding sharpness against contextual completeness, with an empirically corpus-specific optimum.

Overlap

Boundary insurance

Shared content between adjacent chunks — protecting ideas that would otherwise be severed at split points.

Structural Splitting

Respecting the author

Headings, paragraphs, and sections as boundaries — human-authored coherence reused as retrieval coherence.

Semantic Splitting

Topic-shift detection

Embedding-similarity breaks where subjects change — boundaries by meaning where structure is absent or unreliable.

Hierarchical Schemes

Match small, return large

Fine chunks for sharp retrieval, parent sections for complete context — the both-ways answer to the granularity trade.

Parsing Layer

The unglamorous gate

PDF, table, and layout extraction quality — where enterprise chunking pipelines actually succeed or silently fail.

// strategic implications

What this changes for the business

01 · Leverage

The cheapest big lever in RAG

Chunking tuning routinely moves retrieval quality double digits over defaults — no model change, no infrastructure, just evaluation and adjustment. Before upgrading embeddings or models, audit and tune the segmentation; it's the highest ROI step most teams skip.

02 · Debugging

Read the chunks before blaming the model

When RAG hallucinates or misses, the retrieved fragments usually tell the story — context severed at boundaries, topics blurred in oversized chunks, tables mangled in parsing. Chunk audits convert mysterious generation failures into fixable preprocessing defects.

03 · Content

Document hygiene becomes retrieval performance

Well-structured source documents — clear headings, coherent sections, clean formatting — chunk well automatically; sprawling unstructured ones fight every scheme. Authoring standards for knowledge bases now have a measurable AI payoff worth governing toward.

// common misconceptions

What Chunking is not

Myth

“Chunking is a default setting, not a decision.”

Reality

Framework defaults are calibrated to nobody's corpus — and segmentation caps retrieval system-wide. The teams that evaluate and tune chunking outperform identical stacks that didn't, consistently and measurably.

Myth

“Bigger chunks mean more context, so bigger is safer.”

Reality

Oversized chunks embed as muddy averages that match queries poorly, and they spend prompt budget on irrelevant surroundings. Context per chunk and findability trade off — the optimum is balanced, not maximal.

Myth

“Long-context models make chunking obsolete.”

Reality

Long context changes how much retrieved material fits, not whether retrieval needs sharp segments to find the right material first. Precision economics keep chunking decisions alive at every window size.

// from literacy to leverage

Know the term. Now build the strategy.

Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.

AI innovation, applied

Chunking

What Chunking actually is

Cutting documents without cutting meaning

The components teams must understand

What this changes for the business

What Chunking is not

Explore the wider architecture

Know the term. Now build the strategy.