# Chunking — Document Segmentation Process

> Splitting documents into retrieval-sized segments before embedding and indexing — the preprocessing decision that quietly sets the ceiling on RAG quality. Chunk too large and retrieval loses precision; too small and retrieved fragments lose the context that made them meaningful.

**Canonical URL:** https://www.andekian.com/ai-lexicon/chunking  
**Author / Site:** Stephen Andekian — https://www.andekian.com

**Term 59 of 100** · Retrieval & Knowledge  
**Tags:** Segmentation, RAG Quality, Granularity, Context

## Key Stats

- **Typical size — 200–800 tokens:** The working range for most RAG chunks — balanced between embedding focus and contextual completeness.
- **Standard aid — 10–20% overlap:** Adjacent chunks sharing boundary content — insurance against ideas severed mid-thought at split points.
- **Verdict — structure wins:** Splitting on headings, paragraphs, and semantic boundaries consistently outperforms fixed-size cutting in retrieval evaluations.

## What Chunking Actually Is

Every RAG system makes a decision before any query arrives: how to cut documents into the segments that will be embedded, indexed, and retrieved. The decision looks like plumbing and acts like destiny. An embedding represents its chunk's overall meaning — a chunk spanning five topics embeds as a muddy average no query matches well; a chunk of one sentence matches sharply but arrives stripped of the context that made it true. Retrieval quality, and therefore generation quality, is bounded by what the chunks made findable.

The granularity trade is the core tension. Large chunks preserve context but dilute embeddings and waste prompt budget on irrelevant surroundings; small chunks sharpen matching but fragment meaning — the retrieved sentence whose critical caveat lived in the paragraph above it. Overlap between adjacent chunks softens boundary damage, and the working optimum for most corpora sits in the few-hundred-token range — but the honest answer is empirical, and corpus-specific.

Strategy has converged on respecting structure. Documents carry their own segmentation — sections, headings, paragraphs, list items, table boundaries — and splitting along it consistently beats fixed-size cutting, because human-authored boundaries already wrap coherent units of meaning. Semantic chunking goes further, detecting topic shifts by embedding similarity; hierarchical schemes index small chunks for matching while returning their parent sections for context; metadata (source, section, headings) rides along to preserve provenance. Real-world parsing — PDFs, tables, slide decks — is where these pipelines earn their keep or quietly fail.

The operational guidance is unglamorous: chunking is a tunable system parameter, not a set-and-forget default. Retrieval evaluation against representative queries — testing sizes, boundaries, and overlap — routinely yields double-digit retrieval gains over framework defaults, making it among the highest-ROI optimizations in any RAG stack. When a RAG system underperforms, audit the chunks before blaming the model: read what retrieval actually returned, and the fragmentation usually explains the hallucination.

## How It Works: Cutting documents without cutting meaning

Chunking decides what retrieval can find and what generation gets to read — a preprocessing step with system-wide consequences.

1. **Document Parsing** — Source formats — PDFs, wikis, slides, tables — convert to clean text with structure preserved. Garbage here propagates everywhere.
2. **Boundary Selection** — Split points are chosen — by structure, semantics, or size — deciding what units of meaning will exist for retrieval.
3. **Overlap & Sizing** — Chunk dimensions and boundary overlap are set — the granularity trade tuned to corpus and query patterns.
4. **Metadata Attachment** — Source, section, and heading context rides with each chunk — provenance for citation, hierarchy for assembly.
5. **Embedding & Indexing** — Chunks vectorize and land in the index — the segmentation decisions now baked into everything retrievable.
6. **Retrieval Evaluation** — Representative queries score the scheme — the feedback loop that turns a default into a tuned parameter.

## Anatomy: The Components Teams Must Understand

- **Chunk Size** (The granularity dial): Tokens per segment — trading embedding sharpness against contextual completeness, with an empirically corpus-specific optimum.
- **Overlap** (Boundary insurance): Shared content between adjacent chunks — protecting ideas that would otherwise be severed at split points.
- **Structural Splitting** (Respecting the author): Headings, paragraphs, and sections as boundaries — human-authored coherence reused as retrieval coherence.
- **Semantic Splitting** (Topic-shift detection): Embedding-similarity breaks where subjects change — boundaries by meaning where structure is absent or unreliable.
- **Hierarchical Schemes** (Match small, return large): Fine chunks for sharp retrieval, parent sections for complete context — the both-ways answer to the granularity trade.
- **Parsing Layer** (The unglamorous gate): PDF, table, and layout extraction quality — where enterprise chunking pipelines actually succeed or silently fail.

## Strategic Implications

- **The cheapest big lever in RAG** (01 · Leverage): Chunking tuning routinely moves retrieval quality double digits over defaults — no model change, no infrastructure, just evaluation and adjustment. Before upgrading embeddings or models, audit and tune the segmentation; it's the highest ROI step most teams skip.
- **Read the chunks before blaming the model** (02 · Debugging): When RAG hallucinates or misses, the retrieved fragments usually tell the story — context severed at boundaries, topics blurred in oversized chunks, tables mangled in parsing. Chunk audits convert mysterious generation failures into fixable preprocessing defects.
- **Document hygiene becomes retrieval performance** (03 · Content): Well-structured source documents — clear headings, coherent sections, clean formatting — chunk well automatically; sprawling unstructured ones fight every scheme. Authoring standards for knowledge bases now have a measurable AI payoff worth governing toward.

## Common Misconceptions

- **Myth:** “Chunking is a default setting, not a decision.”  
  **Reality:** Framework defaults are calibrated to nobody's corpus — and segmentation caps retrieval system-wide. The teams that evaluate and tune chunking outperform identical stacks that didn't, consistently and measurably.
- **Myth:** “Bigger chunks mean more context, so bigger is safer.”  
  **Reality:** Oversized chunks embed as muddy averages that match queries poorly, and they spend prompt budget on irrelevant surroundings. Context per chunk and findability trade off — the optimum is balanced, not maximal.
- **Myth:** “Long-context models make chunking obsolete.”  
  **Reality:** Long context changes how much retrieved material fits, not whether retrieval needs sharp segments to find the right material first. Precision economics keep chunking decisions alive at every window size.

## Related Terms

- [Context Window — Operational Memory Limit](https://www.andekian.com/ai-lexicon/context-window)
- [RAG — Retrieval-Augmented Generation](https://www.andekian.com/ai-lexicon/rag)
- [Embeddings — Meaning Encoded As Vectors](https://www.andekian.com/ai-lexicon/embeddings)
- [Vector Database — Stores Vector Embeddings](https://www.andekian.com/ai-lexicon/vector-database)
- [Semantic Search — Meaning-Based Retrieval](https://www.andekian.com/ai-lexicon/semantic-search)
- [Context Injection — Dynamic Information Insertion](https://www.andekian.com/ai-lexicon/context-injection)
- [Context Compression — Smaller Context Footprint](https://www.andekian.com/ai-lexicon/context-compression)
- [Retrieval Pipeline — Information Retrieval Flow](https://www.andekian.com/ai-lexicon/retrieval-pipeline)

## Explore the Full Lexicon

All 100 terms: https://www.andekian.com/ai-lexicon

## Contact

Book a conversation or send an inquiry: https://www.andekian.com/#contact
LinkedIn: https://www.linkedin.com/in/andekian/