// term 41 · Model Efficiency

Mixture of Experts

Specialized Sub-Model Routing

An architecture that replaces monolithic layers with banks of specialist sub-networks — “experts” — and a learned router that sends each token to only a few of them. MoE scales stored capability toward trillions of parameters while per-token compute stays at mid-size levels, the economics behind several frontier systems.

MoERoutingConditional ComputeFrontier Scale

// Active

2 of 8–64

Experts typically engaged per token — a small learned committee selected from a large bench of specialists.

// Leverage

5–10x

More stored parameters per unit of inference compute versus dense models — capability held cheap until needed.

// Adoption

frontier

Multiple flagship systems — including Mixtral-class open models and reported frontier architectures — are MoE designs.

// full definition

What Mixture of Experts actually is

Mixture of Experts answers a scaling dilemma: capability grows with parameters, but dense models pay for every parameter on every token. MoE restructures the network's feed-forward layers into parallel expert banks — each a full sub-network — fronted by a small learned router. Per token, the router activates only the top few experts; the rest sit idle at near-zero compute cost. The model knows like a giant and runs like a mid-size.

The router is the architecture's hinge. Trained jointly with the experts, it learns to dispatch tokens by their computational needs — and experts, receiving systematically different traffic, differentiate into implicit specializations. Keeping this healthy requires deliberate engineering: load-balancing losses prevent the collapse where a few favored experts absorb all traffic while the rest atrophy, and capacity limits manage the uneven token batches that routing inevitably creates.

The costs are memory and systems complexity. Every expert stays resident in accelerator memory regardless of activation — an MoE's serving footprint tracks total parameters even as its speed tracks active ones. At scale, experts shard across devices and tokens physically travel between them, making communication overhead and distributed orchestration first-class concerns. MoE buys its compute economics with serving sophistication — a trade that favors high-volume operators.

For model selection, MoE changed how to read a spec sheet. Headline parameter counts no longer indicate inference cost — a “47B” MoE may run like a 13B dense model while storing 47B worth of capability. Compare models on active parameters for speed and cost, total parameters for memory and capability breadth, and measured task performance above all. The architecture is also why API economics keep improving: frontier capability increasingly ships on mid-size compute budgets.

// how it works

Routing tokens to specialists

Every token takes a different path through an MoE model — the router's choices are where the architecture's economics and its quality both live.

Token Arrival

A token's representation reaches an MoE layer — where, unlike a dense layer, its path is about to be decided, not predetermined.

Router Scoring

The gate network scores every expert for this token — a learned judgment of which specialists fit this input.

Top-K Selection

The highest-scoring few experts are chosen — typically two — defining this token's personal slice of the network.

Expert Computation

Selected experts process the token in parallel; unselected experts spend nothing — conditional compute cashing out.

Weighted Merge

Expert outputs combine, weighted by router confidence — a single full-quality representation from fractional work.

Balance Maintenance

Auxiliary losses and capacity limits keep traffic spread across the bench — protecting both training stability and serving efficiency.

// anatomy

The components teams must understand

Expert Bank

The specialist bench

Parallel feed-forward sub-networks holding most of the model's parameters — differentiated by the traffic the router sends them.

Gate Network

The learned dispatcher

The small router scoring experts per token. Its quality determines whether sparsity preserves capability or fragments it.

Top-K Activation

The sparsity dial

How many experts run per token — the parameter that sets the compute-quality operating point of the whole architecture.

Load-Balancing Loss

Anti-collapse pressure

Training terms that spread traffic across experts — without which routing degenerates and most of the bench goes dead.

Memory Residency

The unsaved cost

All experts occupy accelerator memory regardless of use — MoE economizes compute while paying full price in RAM.

Expert Parallelism

Distributed serving

Experts sharded across devices, tokens routed between them — the communication-heavy systems layer MoE serving requires.

// strategic implications

What this changes for the business

01 · Economics

Frontier knowledge at mid-size compute

MoE is a structural reason capability-per-dollar keeps improving across the industry — vendors store more knowledge without charging proportionally more compute. The architecture underwrites current API pricing trends and the continued viability of scaling itself.

02 · Evaluation

Spec sheets need two parameter numbers

Total parameters indicate capability breadth and memory footprint; active parameters indicate speed and serving cost. MoE makes the two diverge by design — comparing models across the sparse-dense divide on a single headline number misleads in whichever direction the vendor prefers.

03 · Operations

Self-hosting MoE is a systems commitment

Full memory residency, router behavior, and cross-device token routing make MoE serving meaningfully harder than dense serving. The compute savings are real at scale — and consumed partly by the engineering bench that collects them. Below serious volume, dense models are the simpler buy.

// common misconceptions

What Mixture of Experts is not

Myth

“Experts are deliberate domain specialists — one for law, one for code.”

Reality

Specializations emerge from routing patterns and rarely map to human categories — experts split on token types, syntax, and statistics more than on topics. The design is learned division of labor, not an org chart.

Myth

“A 47B MoE costs what a 47B dense model costs to run.”

Reality

Per-token compute tracks active parameters — often a quarter or less of the total. Memory, however, tracks the full count. MoE splits “size” into two numbers, and both matter to different budgets.

Myth

“MoE is always superior to dense architectures.”

Reality

MoE wins where stored breadth per compute dollar matters and serving scale justifies the complexity. Dense models train more stably, serve more simply, and fine-tune more predictably — the frontier uses both, by workload.

// from literacy to leverage

Know the term. Now build the strategy.

Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.

AI innovation, applied

Mixture of Experts

What Mixture of Experts actually is

Routing tokens to specialists

The components teams must understand

What this changes for the business

What Mixture of Experts is not

Explore the wider architecture

Know the term. Now build the strategy.