# Quantization — Reduced Precision Models

> Shrinking a model by storing its weights in fewer bits — from 16-bit floats down to 8-, 4-, even 2-bit integers. Quantization cuts memory footprint and inference cost severalfold with minimal quality loss, and is the enabling technology behind affordable serving and on-device AI.

**Canonical URL:** https://www.andekian.com/ai-lexicon/quantization  
**Author / Site:** Stephen Andekian — https://www.andekian.com

**Term 38 of 100** · Model Efficiency  
**Tags:** INT8/INT4, Compression, Edge Deployment, Serving Cost

## Key Stats

- **Compression — 4–8x:** Memory reduction from bf16 to INT4 — a 70B model shrinks from ~140GB toward ~35GB, crossing single-GPU and laptop thresholds.
- **Quality cost — 1–3%:** Typical benchmark degradation for well-executed 4-bit quantization — often imperceptible on deployed tasks.
- **Throughput — 2–4x:** Inference speedup from reduced memory bandwidth — the same hardware serves multiples more traffic.

## What Quantization Actually Is

A model's weights are billions of numbers, and the precision of each is a budget decision. Full training precision spends 16 or 32 bits per weight; quantization re-stores them on a coarser grid — 8, 4, or fewer bits — accepting tiny rounding errors in exchange for a dramatically smaller artifact. The empirical surprise that built an industry: neural networks tolerate this remarkably well. Capability lives in the pattern across billions of weights, not in the final decimal places of any one of them.

The mechanics matter to quality. Naive uniform rounding degrades models; production methods are smarter — calibrating quantization ranges on sample data, keeping outlier-sensitive weights at higher precision, quantizing per-channel rather than per-tensor. Activations and the KV cache can be quantized alongside weights for compounding gains. Quantization-aware training goes further, teaching the model during training to be robust to its eventual low-precision life.

The economics are straightforwardly large. Memory is the binding constraint of LLM serving — a 4x smaller model fits more concurrent users on the same GPU, or fits on hardware it previously couldn't touch at all. Bandwidth savings translate directly into tokens-per-second. The thresholds quantization crosses are strategic: models that required a GPU cluster fit a single card; models that required a server fit a laptop or phone — which is the entire technical foundation of on-device and edge AI.

The discipline is evaluation, because quality loss is real but uneven. Aggregate benchmarks may dip a point while specific capabilities — long-chain math, low-resource languages, edge-case reasoning — degrade more. The production rule: evaluate the quantized model on your workload, not on the leaderboard, and choose the precision point empirically. The menu (8-bit safe, 4-bit standard, 2-3-bit aggressive) is a cost-quality frontier, and where you sit on it is a measured decision.

## How It Works: Fewer bits, nearly the same model

Quantization maps continuous weights onto a coarse grid — the craft is in choosing what precision to spend where.

1. **Precision Selection** — The target format is chosen — INT8 for safety, INT4 as the modern standard, lower bits where economics demand and quality permits.
2. **Calibration** — Sample data flows through the model to map actual value ranges — the statistics that make rounding intelligent rather than naive.
3. **Weight Mapping** — Continuous weights snap to the discrete grid — per-channel scaling and outlier handling preserving what matters most.
4. **Mixed-Precision Rescue** — Sensitive layers and outlier weights keep higher precision — spending bits where the model actually needs them.
5. **Task Evaluation** — The quantized model is benchmarked on your workloads against the original — uneven degradation caught before deployment, not after.
6. **Optimized Serving** — The compressed model deploys on kernels built for low-precision math — where the memory savings become throughput and cost wins.

## Anatomy: The Components Teams Must Understand

- **Bit Width** (The precision dial): Bits per weight — 16 down to 2. Each halving doubles capacity per GPU and raises the quality-engineering stakes.
- **Calibration Data** (Rounding, informed): Representative samples that reveal value distributions — the difference between statistical mapping and blind truncation.
- **Outlier Handling** (The quality guard): Rare large-magnitude weights that naive grids destroy — preserved through mixed precision and per-channel schemes.
- **GPTQ / AWQ Methods** (Post-training standards): The algorithm families that quantize finished models in hours without retraining — the workhorse of open-model deployment.
- **Quantization-Aware Training** (Robustness built in): Training with simulated low precision so the model learns to live there — highest quality at the highest pipeline cost.
- **KV-Cache Quantization** (The second target): Compressing attention state alongside weights — where long-context serving memory actually goes, and saves.

## Strategic Implications

- **The cheapest capacity multiplier in serving** (01 · Economics): Quantization multiplies users-per-GPU and tokens-per-second on hardware you already run — typically the first and highest-ROI optimization in any self-hosted deployment. If you serve models and haven't quantized, you are paying for precision your tasks don't use.
- **The technology behind on-device AI** (02 · Reach): Every threshold quantization crosses — single GPU, laptop, phone — unlocks deployment classes with different privacy, latency, and cost profiles. Edge AI strategies stand or fall on quantization quality; it is the enabler, not an optimization detail.
- **Evaluate the model you deploy, not its parent** (03 · Diligence): Quantized variants are different models with unevenly distributed quality loss — aggregate benchmarks hide capability-specific regressions. Run your evaluation suite on the exact artifact you ship, and know which precision tier your vendors are actually serving you.

## Common Misconceptions

- **Myth:** “Quantization wrecks model quality.”  
  **Reality:** Modern methods hold 4-bit degradation to a few points or less — often imperceptible in deployment. The wreckage stories come from naive rounding; calibrated, outlier-aware quantization is a different technology.
- **Myth:** “Lower bits is always the right trade.”  
  **Reality:** Degradation is uneven across capabilities and accelerates below 4 bits. The right precision is workload-specific and empirical — a measured point on the cost-quality frontier, not a race to the bottom.
- **Myth:** “Quantization is a niche deployment trick.”  
  **Reality:** It is standard practice across the industry — most served models, open and closed, run quantized in production. The question isn't whether to quantize but which precision your quality bar permits.

## Related Terms

- [Token — Unit Of AI Processing](https://www.andekian.com/ai-lexicon/token)
- [SLMs & Distillation — Compression · Speed · Deployment](https://www.andekian.com/ai-lexicon/slms-and-distillation)
- [Weights & Parameters — Learned Intelligence As Math](https://www.andekian.com/ai-lexicon/weights-and-parameters)
- [Inference — Runtime AI Execution](https://www.andekian.com/ai-lexicon/inference)
- [Open Weights — Public Model Parameters](https://www.andekian.com/ai-lexicon/open-weights)
- [Model Pruning — Removes Unnecessary Weights](https://www.andekian.com/ai-lexicon/model-pruning)
- [Sparse Models — Partial Network Activation](https://www.andekian.com/ai-lexicon/sparse-models)
- [AI Inference Engine — Model Execution Infrastructure](https://www.andekian.com/ai-lexicon/ai-inference-engine)

## Explore the Full Lexicon

All 100 terms: https://www.andekian.com/ai-lexicon

## Contact

Book a conversation or send an inquiry: https://www.andekian.com/#contact
LinkedIn: https://www.linkedin.com/in/andekian/