// term 07 · Model Efficiency

SLMs & Distillation

Small Language Models & Knowledge Distillation

Compressing frontier-model capability into compact architectures — typically 1B–13B parameters — by training small “student” models on the outputs and behavior of large “teachers.” The result: near-frontier performance on scoped tasks at a fraction of the cost, latency, and footprint.

DistillationEdge AILatencyUnit Economics

// Scale

1B–13B

Typical SLM parameter range — small enough for a single GPU, a laptop, or increasingly a phone.

// Economics

10–30x

Inference cost reduction when a distilled specialist replaces a frontier API on high-volume, scoped tasks.

// Retention

90%+

Of teacher performance achievable on narrow domains. The gap concentrates in open-ended reasoning, not routine execution.

// full definition

What SLMs & Distillation actually is

Knowledge distillation inverts the usual training economics: instead of learning from raw internet text, a small student model learns from the polished outputs of a frontier teacher. The teacher generates millions of high-quality answers, labels, and reasoning traces across a defined task distribution; the student trains to reproduce them. Capability that cost hundreds of millions to discover transfers for a fraction of that cost.

The classic technique trains the student on the teacher's full output distribution — soft targets — rather than just final answers, transferring calibrated uncertainty along with behavior. Stacked with quantization and pruning, the result is a model that runs where frontier models cannot: on a single GPU, at the edge, inside a browser, or fully on-device where data residency and latency demands rule out API calls.

The economics are the headline. High-volume, well-scoped tasks — classification, extraction, routing, templated drafting, summarization at scale — rarely need frontier generality, and paying frontier prices for them is the most common overspend in enterprise AI. A distilled specialist serving those workloads converts variable API spend into fixed, predictable serving cost, with payback periods measured in weeks at volume.

The trade is generality. Students inherit task behavior, not breadth — outside the distilled distribution, performance falls off sharply. Production architectures handle this with routing: easy, high-volume traffic flows to the student; ambiguous or novel cases escalate to the teacher. The portfolio — not any single model — is the unit of design.

// how it works

Compressing a giant into a specialist

Distillation transfers behavior from teacher to student through data — the pipeline is straightforward; the leverage is in task scoping.

Task Scoping

Define the narrow workload the student must master. Distillation succeeds on bounded tasks and degrades on open-ended generality — scoping is the design decision.

Teacher Generation

The frontier model produces high-quality outputs — answers, labels, reasoning traces — across the task distribution, becoming the student's training corpus.

Student Training

The small model trains to reproduce teacher outputs, often matching full probability distributions (soft targets) rather than single answers.

Compression Stack

Quantization and pruning layer on top of distillation, shrinking the student further to fit target hardware budgets.

Gap Evaluation

Student and teacher are scored head-to-head on the task eval. The decision is empirical: is the remaining quality gap worth the cost and latency savings?

Deployment & Routing

The student ships to its runtime — server fleet, browser, device — with a router escalating hard cases back to the teacher.

// anatomy

The components teams must understand

Teacher Model

The capability source

A frontier model whose behavior defines the target. Teacher quality caps student quality — distillation transfers capability, it does not create it.

Student Architecture

The compact vessel

A small transformer sized for the deployment target. Architecture choice trades capability ceiling against memory and latency budgets.

Soft Targets

Richer training signal

Training on the teacher's full probability distribution transfers calibrated uncertainty, not just answers — the classic distillation advantage.

Synthetic Corpus

Teacher-generated data

Millions of teacher interactions covering the task space. Corpus diversity determines how well the student generalizes within scope.

Quantization Layer

Precision dieting

INT8/INT4 weight compression stacked on distillation — multiplying footprint savings with minimal additional quality loss.

Fallback Router

Escalation path

Production systems route easy traffic to the student and escalate ambiguity to the teacher — capturing savings without capping quality.

// strategic implications

What this changes for the business

01 · Economics

High-volume AI runs on small models

Paying frontier prices for routine traffic is the most common overspend in enterprise AI. Distilled specialists turn variable API spend into fixed, predictable serving costs — and at meaningful volume, the payback period is measured in weeks. Audit your traffic mix before your next API contract renewal.

02 · Privacy

On-device AI changes the data equation

SLMs run where the data lives — devices, branch infrastructure, air-gapped environments. Workloads previously blocked on data-residency, sovereignty, or latency grounds become feasible when the model travels to the data instead of the reverse.

03 · Strategy

Think portfolio, not flagship

Mature AI stacks are tiered: distilled specialists on the high-volume floor, mid-size generalists in the middle, frontier calls reserved for the hard tail. Model routing is becoming a core architecture competency — the organizations that master it get frontier quality at commodity blended cost.

// common misconceptions

What SLMs & Distillation is not

Myth

“Small models are just worse big models.”

Reality

On scoped tasks, distilled SLMs routinely match their teachers while winning decisively on cost, latency, and deployability. Generality is what you sacrifice — not task quality. The question is whether your workload needs breadth or excellence at one thing.

Myth

“Distillation clones the teacher.”

Reality

Students inherit task behavior, not breadth. Outside the distilled distribution, performance falls off sharply — which is why task scoping and fallback routing are part of the design, not afterthoughts.

Myth

“One frontier model is simpler, so it is cheaper.”

Reality

Operational simplicity at 10–30x the unit cost rarely survives volume growth. The routing layer pays for itself quickly — and preserves frontier budget for the problems that genuinely need frontier capability.

// from literacy to leverage

Know the term. Now build the strategy.

Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.

AI innovation, applied

SLMs & Distillation

What SLMs & Distillation actually is

Compressing a giant into a specialist

The components teams must understand

What this changes for the business

What SLMs & Distillation is not

Explore the wider architecture

Know the term. Now build the strategy.