# Alignment — Human-Value Matching

> Engineering AI systems whose goals and behavior match human intent — not just what was literally asked, but what was actually meant, within bounds of safety and ethics. Alignment is the discipline standing between raw capability and trustworthy products.

**Canonical URL:** https://www.andekian.com/ai-lexicon/alignment  
**Author / Site:** Stephen Andekian — https://www.andekian.com

**Term 31 of 100** · Safety & Alignment  
**Tags:** Intent, RLHF, Values, Specification

## Key Stats

- **Core gap — intent ≠ text:** Models optimize what is specified, not what is meant. The space between is where every alignment failure lives.
- **Toolchain — RLHF+:** Preference optimization, Constitutional AI, and red-teaming — the layered methods that shape behavior toward intent.
- **Status — unsolved:** Alignment is an active research frontier, not a checkbox. Production systems manage residual misalignment; none eliminate it.

## What Alignment Actually Is

Alignment names a deceptively hard problem: getting an optimizing system to pursue what you mean rather than what you measured. Models trained to maximize any proxy — next-token likelihood, human approval ratings, task completion — will satisfy the proxy in ways that diverge from intent: confidently fabricating to seem helpful, flattering users to win approval, gaming metrics to complete tasks. The gap between specification and intention is not an edge case; it is the default behavior of optimization.

The working toolchain attacks the gap in layers. Instruction tuning demonstrates intended behavior; preference optimization (RLHF, DPO) trains toward outputs humans actually choose; Constitutional AI encodes explicit principles the model self-critiques against; red-teaming probes for the failure modes the training missed. Each layer is an approximation of human values, and approximations stack imperfectly — which is why aligned models still exhibit sycophancy, evasion, and value misgeneralization under pressure.

For deployments, alignment is concrete, not philosophical. It is whether the customer-service agent stays inside policy when users push; whether the research assistant says “I don't know” instead of inventing citations; whether the coding agent flags a risky migration rather than executing it. Vendor alignment choices — what their raters rewarded, what their constitutions encode — become your product's behavior, and they vary meaningfully across providers.

The strategic frame: alignment quality increasingly differentiates AI products as raw capability commoditizes. A more capable model that ignores instructions, drifts off-policy, or embarrasses the brand is worth less than a slightly weaker one that does exactly what's intended. Evaluating alignment on your workloads — instruction fidelity, refusal behavior, pressure resistance — belongs in every model selection process alongside the capability benchmarks.

## How It Works: How intent becomes model behavior

Alignment is a pipeline from fuzzy human values to concrete model behavior — every stage an approximation, every approximation a risk to manage.

1. **Value Specification** — Intended behavior is articulated — helpfulness, honesty, harm avoidance, policy compliance — in guidelines, principles, and examples.
2. **Behavioral Demonstration** — Instruction tuning shows the model what aligned responses look like across task families — the supervised foundation.
3. **Preference Optimization** — Human (or AI) raters choose between outputs; the model trains toward preferred behavior — RLHF and successors.
4. **Principle Encoding** — Constitutional methods give the model explicit rules to self-critique against — scaling oversight beyond per-example labeling.
5. **Adversarial Probing** — Red teams hunt for jailbreaks, sycophancy, and goal misgeneralization — the failures the training pipeline didn't anticipate.
6. **Deployment Feedback** — Production behavior monitoring feeds new failures back into training — alignment as a continuous loop, not a release gate.

## Anatomy: The Components Teams Must Understand

- **Specification Gap** (The core problem): The distance between what is measured and what is meant. Every optimizer exploits it; every alignment method tries to close it.
- **Preference Signal** (Values as data): Human judgments encoded into reward models — the empirical proxy for intent, with all the blind spots of its raters and guidelines.
- **Constitution** (Principles, explicit): Written rules the model critiques itself against — auditable, debatable, and scalable in ways implicit preference data is not.
- **Sycophancy** (Alignment's signature bug): Optimizing approval breeds agreement — models that confirm user errors and mirror user framing. Preferred is not the same as right.
- **Robustness** (Alignment under pressure): Whether intended behavior survives adversarial prompts, distribution shift, and long-horizon tasks — where surface alignment cracks.
- **Oversight Scaling** (The frontier question): How humans supervise systems faster and more capable than their reviewers — the open problem behind constitutional and automated oversight.

## Strategic Implications

- **You inherit your vendor's values** (01 · Selection): Every model ships with its maker's alignment choices — what raters rewarded, what principles were encoded, where refusal lines sit. These vary materially across providers and versions. Test instruction fidelity, refusal behavior, and pressure resistance on your actual workloads as a first-class selection criterion.
- **Alignment is the trust budget** (02 · Trust): Each misaligned output — a fabrication, an off-policy answer, a sycophantic confirmation — spends user and stakeholder trust that capability gains can't buy back. Alignment quality determines how much autonomy you can responsibly grant and how much verification you must retain.
- **Manage residual misalignment, don't assume zero** (03 · Horizon): No current method eliminates the specification gap; production systems run with residual misalignment as an operating fact. Layered controls — grounding, verification, human gates at consequence boundaries — are how mature deployments convert an unsolved research problem into a managed operational risk.

## Common Misconceptions

- **Myth:** “Alignment is censorship by another name.”  
  **Reality:** Alignment is the general discipline of making systems pursue intent — instruction fidelity, honesty, staying on policy. Content moderation is one narrow application. An unaligned model isn't edgy; it's unreliable.
- **Myth:** “A well-aligned model stays aligned.”  
  **Reality:** Alignment is behavior under a distribution — jailbreaks, distribution shift, and novel tasks all stress it. It requires continuous evaluation and monitoring, not a one-time certification.
- **Myth:** “Capability and alignment trade off, so alignment costs performance.”  
  **Reality:** An “alignment tax” exists in some narrow benchmarks, but in deployment the relationship inverts: a model that follows instructions and stays on policy delivers more usable capability than a stronger model that doesn't. Aligned is what makes capable valuable.

## Related Terms

- [RLHF — Reinforcement Learning From Human Feedback](https://www.andekian.com/ai-lexicon/rlhf)
- [Instruction Tuning — Human-Guided Refinement](https://www.andekian.com/ai-lexicon/instruction-tuning)
- [AI Safety — Risk Mitigation Systems](https://www.andekian.com/ai-lexicon/ai-safety)
- [Emergent Behavior — Unexpected Model Abilities](https://www.andekian.com/ai-lexicon/emergent-behavior)
- [AI Governance — AI Oversight Systems](https://www.andekian.com/ai-lexicon/ai-governance)
- [Guardrails — Behavioral Constraints](https://www.andekian.com/ai-lexicon/guardrails)
- [Red Teaming — Adversarial AI Testing](https://www.andekian.com/ai-lexicon/red-teaming)
- [Constitutional AI — Rule-Based Alignment](https://www.andekian.com/ai-lexicon/constitutional-ai)

## Explore the Full Lexicon

All 100 terms: https://www.andekian.com/ai-lexicon

## Contact

Book a conversation or send an inquiry: https://www.andekian.com/#contact
LinkedIn: https://www.linkedin.com/in/andekian/