// term 13 · Production & Operations
Inference
Runtime AI Execution
Running a trained model on new inputs to produce outputs — the production phase of AI. Training happens once; inference happens billions of times. Latency, throughput, and unit cost are all inference properties, and nearly all enterprise AI spend accrues here.
// Spend share
80–90%
Of lifetime compute for a deployed model typically lands in inference, not training. The recurring bill dwarfs the one-time one.
// Phases
2
Prefill processes the prompt in parallel; decode generates output one token at a time. Each phase has different bottlenecks and different optimizations.
// Leverage
2–10x
Throughput gains routinely available from batching, caching, and quantization on identical hardware — serving engineering pays for itself.
// full definition
What Inference actually is
Training gets the headlines; inference pays the bills. Once a model is trained, every chatbot reply, document summary, and agent action is an inference operation — a forward pass through billions of parameters on accelerator hardware. For any successful deployment, cumulative inference compute rapidly dwarfs what training cost, which makes inference efficiency the dominant economic variable in production AI.
A request lives in two phases with opposite characters. Prefill ingests the entire prompt in parallel — compute-intensive but fast per token. Decode then generates output serially, one token at a time, each step requiring a full pass over the model while attention state (the KV cache) occupies GPU memory. This is why long prompts are cheap relative to long outputs, why latency tracks response length, and why streaming exists to mask the serial grind.
Between the model and the hardware sits the inference engine — vLLM, TensorRT-LLM, and peers — where most production efficiency is won. Continuous batching interleaves many users' requests to keep GPUs saturated; paged KV-cache management stretches memory; quantized weights shrink the model's footprint; speculative decoding drafts tokens with a small model and verifies with the large one. The same model on the same hardware can vary several-fold in throughput depending on the serving stack.
For decision-makers, inference is where AI strategy meets the income statement. Unit costs determine which use cases clear ROI hurdles; latency budgets determine which products feel responsive; capacity planning determines whether launches survive their own success. Vendor pricing, model right-sizing, and serving architecture deserve the same scrutiny as any other production infrastructure with a nine-figure trajectory.
// how it works
The life of a single request
Every model call traverses the same pipeline — and each stage carries a distinct lever for cost and latency.
Request Assembly
The application composes the prompt — system instructions, history, retrieved context — and dispatches it to the serving endpoint.
Tokenization
Text becomes token IDs. Input length is now fixed — and with it, the prefill compute bill for this request.
Prefill
The full prompt is processed in parallel, building the KV cache — the attention state the decode phase will read from.
Decode Loop
Output is generated one token per step, each requiring a full model pass. The serial loop is the latency bottleneck of all generative AI.
Streaming & Post-Processing
Tokens stream to the user as generated, masking decode latency; the application parses, validates, and formats the final output.
Metering & Monitoring
Tokens are counted and billed; latency, error rates, and quality signals flow to observability — the feedback loop of production operation.
// anatomy
The components teams must understand
01
Inference Engine
The serving runtime
Software like vLLM or TensorRT-LLM that schedules requests and executes the model. The largest efficiency lever after model choice itself.
02
KV Cache
Attention state in memory
Per-request attention keys and values held in GPU memory throughout generation. Context length × concurrency = the serving memory bill.
03
Batching Scheduler
GPU saturation
Continuous batching interleaves many requests through the hardware simultaneously — the difference between idle silicon and full utilization.
04
Quantized Weights
Smaller, faster math
Reduced-precision parameters cut memory footprint and bandwidth, raising throughput with modest quality cost — standard practice in production serving.
05
Accelerator Fleet
The physical layer
GPUs or custom silicon executing the math. Procurement, utilization, and placement (cloud, on-prem, edge) set the cost floor.
06
Observability Hooks
Production telemetry
Latency percentiles, token throughput, error rates, and quality sampling — the instrumentation that turns serving from a black box into an operation.
// strategic implications
What this changes for the business
01 · Economics
Inference is the recurring bill
Training is capex; inference is opex that scales with success. Unit economics — cost per request, per task, per user — determine which AI products are businesses and which are subsidies. Model right-sizing and serving optimization routinely move these numbers severalfold, which makes them strategy, not plumbing.
02 · Performance
Latency is a product feature
Users abandon slow AI. Decode speed, time-to-first-token, and streaming design shape perceived quality as much as answer accuracy. Latency budgets deserve explicit product requirements — and the serving stack, not just the model, determines whether they are met.
03 · Sovereignty
Where inference runs is a policy decision
API, cloud-hosted, on-premises, or on-device — inference placement determines data residency, privacy posture, and vendor dependence. Regulated industries increasingly treat inference location as a compliance requirement, not an infrastructure preference.
// common misconceptions
What Inference is not
Myth
“Training is the expensive part of AI.”
Reality
Training is a one-time cost; inference recurs with every use and scales with adoption. For successful products, cumulative inference spend dwarfs training within months — the recurring bill is the one that needs engineering.
Myth
“Slow responses mean the model is too big.”
Reality
Serving architecture — batching, caching, quantization, hardware match — often dominates model size in latency outcomes. The same model can vary severalfold in speed across serving stacks. Diagnose the stack before downsizing the model.
Myth
“All tokens cost the same.”
Reality
Output tokens cost several times input tokens because decode is serial while prefill parallelizes — and reasoning models add invisible thinking tokens. Understanding the meter is prerequisite to managing the bill.
// from literacy to leverage
Know the term. Now build the strategy.
Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.