// term 13 · Production & Operations

Inference

Runtime AI Execution

Running a trained model on new inputs to produce outputs — the production phase of AI. Training happens once; inference happens billions of times. Latency, throughput, and unit cost are all inference properties, and nearly all enterprise AI spend accrues here.

ServingLatencyThroughputUnit Cost

// Spend share

80–90%

Of lifetime compute for a deployed model typically lands in inference, not training. The recurring bill dwarfs the one-time one.

// Phases

Prefill processes the prompt in parallel; decode generates output one token at a time. Each phase has different bottlenecks and different optimizations.

// Leverage

2–10x

Throughput gains routinely available from batching, caching, and quantization on identical hardware — serving engineering pays for itself.

// full definition

What Inference actually is

Training gets the headlines; inference pays the bills. Once a model is trained, every chatbot reply, document summary, and agent action is an inference operation — a forward pass through billions of parameters on accelerator hardware. For any successful deployment, cumulative inference compute rapidly dwarfs what training cost, which makes inference efficiency the dominant economic variable in production AI.

A request lives in two phases with opposite characters. Prefill ingests the entire prompt in parallel — compute-intensive but fast per token. Decode then generates output serially, one token at a time, each step requiring a full pass over the model while attention state (the KV cache) occupies GPU memory. This is why long prompts are cheap relative to long outputs, why latency tracks response length, and why streaming exists to mask the serial grind.

Between the model and the hardware sits the inference engine — vLLM, TensorRT-LLM, and peers — where most production efficiency is won. Continuous batching interleaves many users' requests to keep GPUs saturated; paged KV-cache management stretches memory; quantized weights shrink the model's footprint; speculative decoding drafts tokens with a small model and verifies with the large one. The same model on the same hardware can vary several-fold in throughput depending on the serving stack.

For decision-makers, inference is where AI strategy meets the income statement. Unit costs determine which use cases clear ROI hurdles; latency budgets determine which products feel responsive; capacity planning determines whether launches survive their own success. Vendor pricing, model right-sizing, and serving architecture deserve the same scrutiny as any other production infrastructure with a nine-figure trajectory.

// how it works

The life of a single request

Every model call traverses the same pipeline — and each stage carries a distinct lever for cost and latency.

Request Assembly

The application composes the prompt — system instructions, history, retrieved context — and dispatches it to the serving endpoint.

Tokenization

Text becomes token IDs. Input length is now fixed — and with it, the prefill compute bill for this request.

Prefill

The full prompt is processed in parallel, building the KV cache — the attention state the decode phase will read from.

Decode Loop

Output is generated one token per step, each requiring a full model pass. The serial loop is the latency bottleneck of all generative AI.

Streaming & Post-Processing

Tokens stream to the user as generated, masking decode latency; the application parses, validates, and formats the final output.

Metering & Monitoring

Tokens are counted and billed; latency, error rates, and quality signals flow to observability — the feedback loop of production operation.

// anatomy

The components teams must understand

Inference Engine

The serving runtime

Software like vLLM or TensorRT-LLM that schedules requests and executes the model. The largest efficiency lever after model choice itself.

KV Cache

Attention state in memory

Per-request attention keys and values held in GPU memory throughout generation. Context length × concurrency = the serving memory bill.

Batching Scheduler

GPU saturation

Continuous batching interleaves many requests through the hardware simultaneously — the difference between idle silicon and full utilization.

Quantized Weights

Smaller, faster math

Reduced-precision parameters cut memory footprint and bandwidth, raising throughput with modest quality cost — standard practice in production serving.

Accelerator Fleet

The physical layer

GPUs or custom silicon executing the math. Procurement, utilization, and placement (cloud, on-prem, edge) set the cost floor.

Observability Hooks

Production telemetry

Latency percentiles, token throughput, error rates, and quality sampling — the instrumentation that turns serving from a black box into an operation.

// strategic implications

What this changes for the business

01 · Economics

Inference is the recurring bill

Training is capex; inference is opex that scales with success. Unit economics — cost per request, per task, per user — determine which AI products are businesses and which are subsidies. Model right-sizing and serving optimization routinely move these numbers severalfold, which makes them strategy, not plumbing.

02 · Performance

Latency is a product feature

Users abandon slow AI. Decode speed, time-to-first-token, and streaming design shape perceived quality as much as answer accuracy. Latency budgets deserve explicit product requirements — and the serving stack, not just the model, determines whether they are met.

03 · Sovereignty

Where inference runs is a policy decision

API, cloud-hosted, on-premises, or on-device — inference placement determines data residency, privacy posture, and vendor dependence. Regulated industries increasingly treat inference location as a compliance requirement, not an infrastructure preference.

// common misconceptions

What Inference is not

Myth

“Training is the expensive part of AI.”

Reality

Training is a one-time cost; inference recurs with every use and scales with adoption. For successful products, cumulative inference spend dwarfs training within months — the recurring bill is the one that needs engineering.

Myth

“Slow responses mean the model is too big.”

Reality

Serving architecture — batching, caching, quantization, hardware match — often dominates model size in latency outcomes. The same model can vary severalfold in speed across serving stacks. Diagnose the stack before downsizing the model.

Myth

“All tokens cost the same.”

Reality

Output tokens cost several times input tokens because decode is serial while prefill parallelizes — and reasoning models add invisible thinking tokens. Understanding the meter is prerequisite to managing the bill.

// from literacy to leverage

Know the term. Now build the strategy.

Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.

AI innovation, applied

Inference

What Inference actually is

The life of a single request

The components teams must understand

What this changes for the business

What Inference is not

Explore the wider architecture

Know the term. Now build the strategy.