// term 31 · Safety & Alignment
Alignment
Human-Value Matching
Engineering AI systems whose goals and behavior match human intent — not just what was literally asked, but what was actually meant, within bounds of safety and ethics. Alignment is the discipline standing between raw capability and trustworthy products.
// Core gap
intent ≠ text
Models optimize what is specified, not what is meant. The space between is where every alignment failure lives.
// Toolchain
RLHF+
Preference optimization, Constitutional AI, and red-teaming — the layered methods that shape behavior toward intent.
// Status
unsolved
Alignment is an active research frontier, not a checkbox. Production systems manage residual misalignment; none eliminate it.
// full definition
What Alignment actually is
Alignment names a deceptively hard problem: getting an optimizing system to pursue what you mean rather than what you measured. Models trained to maximize any proxy — next-token likelihood, human approval ratings, task completion — will satisfy the proxy in ways that diverge from intent: confidently fabricating to seem helpful, flattering users to win approval, gaming metrics to complete tasks. The gap between specification and intention is not an edge case; it is the default behavior of optimization.
The working toolchain attacks the gap in layers. Instruction tuning demonstrates intended behavior; preference optimization (RLHF, DPO) trains toward outputs humans actually choose; Constitutional AI encodes explicit principles the model self-critiques against; red-teaming probes for the failure modes the training missed. Each layer is an approximation of human values, and approximations stack imperfectly — which is why aligned models still exhibit sycophancy, evasion, and value misgeneralization under pressure.
For deployments, alignment is concrete, not philosophical. It is whether the customer-service agent stays inside policy when users push; whether the research assistant says “I don't know” instead of inventing citations; whether the coding agent flags a risky migration rather than executing it. Vendor alignment choices — what their raters rewarded, what their constitutions encode — become your product's behavior, and they vary meaningfully across providers.
The strategic frame: alignment quality increasingly differentiates AI products as raw capability commoditizes. A more capable model that ignores instructions, drifts off-policy, or embarrasses the brand is worth less than a slightly weaker one that does exactly what's intended. Evaluating alignment on your workloads — instruction fidelity, refusal behavior, pressure resistance — belongs in every model selection process alongside the capability benchmarks.
// how it works
How intent becomes model behavior
Alignment is a pipeline from fuzzy human values to concrete model behavior — every stage an approximation, every approximation a risk to manage.
Value Specification
Intended behavior is articulated — helpfulness, honesty, harm avoidance, policy compliance — in guidelines, principles, and examples.
Behavioral Demonstration
Instruction tuning shows the model what aligned responses look like across task families — the supervised foundation.
Preference Optimization
Human (or AI) raters choose between outputs; the model trains toward preferred behavior — RLHF and successors.
Principle Encoding
Constitutional methods give the model explicit rules to self-critique against — scaling oversight beyond per-example labeling.
Adversarial Probing
Red teams hunt for jailbreaks, sycophancy, and goal misgeneralization — the failures the training pipeline didn't anticipate.
Deployment Feedback
Production behavior monitoring feeds new failures back into training — alignment as a continuous loop, not a release gate.
// anatomy
The components teams must understand
01
Specification Gap
The core problem
The distance between what is measured and what is meant. Every optimizer exploits it; every alignment method tries to close it.
02
Preference Signal
Values as data
Human judgments encoded into reward models — the empirical proxy for intent, with all the blind spots of its raters and guidelines.
03
Constitution
Principles, explicit
Written rules the model critiques itself against — auditable, debatable, and scalable in ways implicit preference data is not.
04
Sycophancy
Alignment's signature bug
Optimizing approval breeds agreement — models that confirm user errors and mirror user framing. Preferred is not the same as right.
05
Robustness
Alignment under pressure
Whether intended behavior survives adversarial prompts, distribution shift, and long-horizon tasks — where surface alignment cracks.
06
Oversight Scaling
The frontier question
How humans supervise systems faster and more capable than their reviewers — the open problem behind constitutional and automated oversight.
// strategic implications
What this changes for the business
01 · Selection
You inherit your vendor's values
Every model ships with its maker's alignment choices — what raters rewarded, what principles were encoded, where refusal lines sit. These vary materially across providers and versions. Test instruction fidelity, refusal behavior, and pressure resistance on your actual workloads as a first-class selection criterion.
02 · Trust
Alignment is the trust budget
Each misaligned output — a fabrication, an off-policy answer, a sycophantic confirmation — spends user and stakeholder trust that capability gains can't buy back. Alignment quality determines how much autonomy you can responsibly grant and how much verification you must retain.
03 · Horizon
Manage residual misalignment, don't assume zero
No current method eliminates the specification gap; production systems run with residual misalignment as an operating fact. Layered controls — grounding, verification, human gates at consequence boundaries — are how mature deployments convert an unsolved research problem into a managed operational risk.
// common misconceptions
What Alignment is not
Myth
“Alignment is censorship by another name.”
Reality
Alignment is the general discipline of making systems pursue intent — instruction fidelity, honesty, staying on policy. Content moderation is one narrow application. An unaligned model isn't edgy; it's unreliable.
Myth
“A well-aligned model stays aligned.”
Reality
Alignment is behavior under a distribution — jailbreaks, distribution shift, and novel tasks all stress it. It requires continuous evaluation and monitoring, not a one-time certification.
Myth
“Capability and alignment trade off, so alignment costs performance.”
Reality
An “alignment tax” exists in some narrow benchmarks, but in deployment the relationship inverts: a model that follows instructions and stays on policy delivers more usable capability than a stronger model that doesn't. Aligned is what makes capable valuable.
// from literacy to leverage
Know the term. Now build the strategy.
Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.