# Inference — Runtime AI Execution

> Running a trained model on new inputs to produce outputs — the production phase of AI. Training happens once; inference happens billions of times. Latency, throughput, and unit cost are all inference properties, and nearly all enterprise AI spend accrues here.

**Canonical URL:** https://www.andekian.com/ai-lexicon/inference  
**Author / Site:** Stephen Andekian — https://www.andekian.com

**Term 13 of 100** · Production & Operations  
**Tags:** Serving, Latency, Throughput, Unit Cost

## Key Stats

- **Spend share — 80–90%:** Of lifetime compute for a deployed model typically lands in inference, not training. The recurring bill dwarfs the one-time one.
- **Phases — 2:** Prefill processes the prompt in parallel; decode generates output one token at a time. Each phase has different bottlenecks and different optimizations.
- **Leverage — 2–10x:** Throughput gains routinely available from batching, caching, and quantization on identical hardware — serving engineering pays for itself.

## What Inference Actually Is

Training gets the headlines; inference pays the bills. Once a model is trained, every chatbot reply, document summary, and agent action is an inference operation — a forward pass through billions of parameters on accelerator hardware. For any successful deployment, cumulative inference compute rapidly dwarfs what training cost, which makes inference efficiency the dominant economic variable in production AI.

A request lives in two phases with opposite characters. Prefill ingests the entire prompt in parallel — compute-intensive but fast per token. Decode then generates output serially, one token at a time, each step requiring a full pass over the model while attention state (the KV cache) occupies GPU memory. This is why long prompts are cheap relative to long outputs, why latency tracks response length, and why streaming exists to mask the serial grind.

Between the model and the hardware sits the inference engine — vLLM, TensorRT-LLM, and peers — where most production efficiency is won. Continuous batching interleaves many users' requests to keep GPUs saturated; paged KV-cache management stretches memory; quantized weights shrink the model's footprint; speculative decoding drafts tokens with a small model and verifies with the large one. The same model on the same hardware can vary several-fold in throughput depending on the serving stack.

For decision-makers, inference is where AI strategy meets the income statement. Unit costs determine which use cases clear ROI hurdles; latency budgets determine which products feel responsive; capacity planning determines whether launches survive their own success. Vendor pricing, model right-sizing, and serving architecture deserve the same scrutiny as any other production infrastructure with a nine-figure trajectory.

## How It Works: The life of a single request

Every model call traverses the same pipeline — and each stage carries a distinct lever for cost and latency.

1. **Request Assembly** — The application composes the prompt — system instructions, history, retrieved context — and dispatches it to the serving endpoint.
2. **Tokenization** — Text becomes token IDs. Input length is now fixed — and with it, the prefill compute bill for this request.
3. **Prefill** — The full prompt is processed in parallel, building the KV cache — the attention state the decode phase will read from.
4. **Decode Loop** — Output is generated one token per step, each requiring a full model pass. The serial loop is the latency bottleneck of all generative AI.
5. **Streaming & Post-Processing** — Tokens stream to the user as generated, masking decode latency; the application parses, validates, and formats the final output.
6. **Metering & Monitoring** — Tokens are counted and billed; latency, error rates, and quality signals flow to observability — the feedback loop of production operation.

## Anatomy: The Components Teams Must Understand

- **Inference Engine** (The serving runtime): Software like vLLM or TensorRT-LLM that schedules requests and executes the model. The largest efficiency lever after model choice itself.
- **KV Cache** (Attention state in memory): Per-request attention keys and values held in GPU memory throughout generation. Context length × concurrency = the serving memory bill.
- **Batching Scheduler** (GPU saturation): Continuous batching interleaves many requests through the hardware simultaneously — the difference between idle silicon and full utilization.
- **Quantized Weights** (Smaller, faster math): Reduced-precision parameters cut memory footprint and bandwidth, raising throughput with modest quality cost — standard practice in production serving.
- **Accelerator Fleet** (The physical layer): GPUs or custom silicon executing the math. Procurement, utilization, and placement (cloud, on-prem, edge) set the cost floor.
- **Observability Hooks** (Production telemetry): Latency percentiles, token throughput, error rates, and quality sampling — the instrumentation that turns serving from a black box into an operation.

## Strategic Implications

- **Inference is the recurring bill** (01 · Economics): Training is capex; inference is opex that scales with success. Unit economics — cost per request, per task, per user — determine which AI products are businesses and which are subsidies. Model right-sizing and serving optimization routinely move these numbers severalfold, which makes them strategy, not plumbing.
- **Latency is a product feature** (02 · Performance): Users abandon slow AI. Decode speed, time-to-first-token, and streaming design shape perceived quality as much as answer accuracy. Latency budgets deserve explicit product requirements — and the serving stack, not just the model, determines whether they are met.
- **Where inference runs is a policy decision** (03 · Sovereignty): API, cloud-hosted, on-premises, or on-device — inference placement determines data residency, privacy posture, and vendor dependence. Regulated industries increasingly treat inference location as a compliance requirement, not an infrastructure preference.

## Common Misconceptions

- **Myth:** “Training is the expensive part of AI.”  
  **Reality:** Training is a one-time cost; inference recurs with every use and scales with adoption. For successful products, cumulative inference spend dwarfs training within months — the recurring bill is the one that needs engineering.
- **Myth:** “Slow responses mean the model is too big.”  
  **Reality:** Serving architecture — batching, caching, quantization, hardware match — often dominates model size in latency outcomes. The same model can vary severalfold in speed across serving stacks. Diagnose the stack before downsizing the model.
- **Myth:** “All tokens cost the same.”  
  **Reality:** Output tokens cost several times input tokens because decode is serial while prefill parallelizes — and reasoning models add invisible thinking tokens. Understanding the meter is prerequisite to managing the bill.

## Related Terms

- [LLM — Large Language Model](https://www.andekian.com/ai-lexicon/llm)
- [Token — Unit Of AI Processing](https://www.andekian.com/ai-lexicon/token)
- [Context Window — Operational Memory Limit](https://www.andekian.com/ai-lexicon/context-window)
- [SLMs & Distillation — Compression · Speed · Deployment](https://www.andekian.com/ai-lexicon/slms-and-distillation)
- [Quantization — Reduced Precision Models](https://www.andekian.com/ai-lexicon/quantization)
- [Mixture of Experts — Specialized Sub-Model Routing](https://www.andekian.com/ai-lexicon/mixture-of-experts)
- [Observability — Production AI Monitoring](https://www.andekian.com/ai-lexicon/observability)
- [AI Inference Engine — Model Execution Infrastructure](https://www.andekian.com/ai-lexicon/ai-inference-engine)

## Explore the Full Lexicon

All 100 terms: https://www.andekian.com/ai-lexicon

## Contact

Book a conversation or send an inquiry: https://www.andekian.com/#contact
LinkedIn: https://www.linkedin.com/in/andekian/