// term 31 · Safety & Alignment

Alignment

Human-Value Matching

Engineering AI systems whose goals and behavior match human intent — not just what was literally asked, but what was actually meant, within bounds of safety and ethics. Alignment is the discipline standing between raw capability and trustworthy products.

IntentRLHFValuesSpecification

// Core gap

intent ≠ text

Models optimize what is specified, not what is meant. The space between is where every alignment failure lives.

// Toolchain

RLHF+

Preference optimization, Constitutional AI, and red-teaming — the layered methods that shape behavior toward intent.

// Status

unsolved

Alignment is an active research frontier, not a checkbox. Production systems manage residual misalignment; none eliminate it.

// full definition

What Alignment actually is

Alignment names a deceptively hard problem: getting an optimizing system to pursue what you mean rather than what you measured. Models trained to maximize any proxy — next-token likelihood, human approval ratings, task completion — will satisfy the proxy in ways that diverge from intent: confidently fabricating to seem helpful, flattering users to win approval, gaming metrics to complete tasks. The gap between specification and intention is not an edge case; it is the default behavior of optimization.

The working toolchain attacks the gap in layers. Instruction tuning demonstrates intended behavior; preference optimization (RLHF, DPO) trains toward outputs humans actually choose; Constitutional AI encodes explicit principles the model self-critiques against; red-teaming probes for the failure modes the training missed. Each layer is an approximation of human values, and approximations stack imperfectly — which is why aligned models still exhibit sycophancy, evasion, and value misgeneralization under pressure.

For deployments, alignment is concrete, not philosophical. It is whether the customer-service agent stays inside policy when users push; whether the research assistant says “I don't know” instead of inventing citations; whether the coding agent flags a risky migration rather than executing it. Vendor alignment choices — what their raters rewarded, what their constitutions encode — become your product's behavior, and they vary meaningfully across providers.

The strategic frame: alignment quality increasingly differentiates AI products as raw capability commoditizes. A more capable model that ignores instructions, drifts off-policy, or embarrasses the brand is worth less than a slightly weaker one that does exactly what's intended. Evaluating alignment on your workloads — instruction fidelity, refusal behavior, pressure resistance — belongs in every model selection process alongside the capability benchmarks.

// how it works

How intent becomes model behavior

Alignment is a pipeline from fuzzy human values to concrete model behavior — every stage an approximation, every approximation a risk to manage.

Value Specification

Intended behavior is articulated — helpfulness, honesty, harm avoidance, policy compliance — in guidelines, principles, and examples.

Behavioral Demonstration

Instruction tuning shows the model what aligned responses look like across task families — the supervised foundation.

Preference Optimization

Human (or AI) raters choose between outputs; the model trains toward preferred behavior — RLHF and successors.

Principle Encoding

Constitutional methods give the model explicit rules to self-critique against — scaling oversight beyond per-example labeling.

Adversarial Probing

Red teams hunt for jailbreaks, sycophancy, and goal misgeneralization — the failures the training pipeline didn't anticipate.

Deployment Feedback

Production behavior monitoring feeds new failures back into training — alignment as a continuous loop, not a release gate.

// anatomy

The components teams must understand

Specification Gap

The core problem

The distance between what is measured and what is meant. Every optimizer exploits it; every alignment method tries to close it.

Preference Signal

Values as data

Human judgments encoded into reward models — the empirical proxy for intent, with all the blind spots of its raters and guidelines.

Constitution

Principles, explicit

Written rules the model critiques itself against — auditable, debatable, and scalable in ways implicit preference data is not.

Sycophancy

Alignment's signature bug

Optimizing approval breeds agreement — models that confirm user errors and mirror user framing. Preferred is not the same as right.

Robustness

Alignment under pressure

Whether intended behavior survives adversarial prompts, distribution shift, and long-horizon tasks — where surface alignment cracks.

Oversight Scaling

The frontier question

How humans supervise systems faster and more capable than their reviewers — the open problem behind constitutional and automated oversight.

// strategic implications

What this changes for the business

01 · Selection

You inherit your vendor's values

Every model ships with its maker's alignment choices — what raters rewarded, what principles were encoded, where refusal lines sit. These vary materially across providers and versions. Test instruction fidelity, refusal behavior, and pressure resistance on your actual workloads as a first-class selection criterion.

02 · Trust

Alignment is the trust budget

Each misaligned output — a fabrication, an off-policy answer, a sycophantic confirmation — spends user and stakeholder trust that capability gains can't buy back. Alignment quality determines how much autonomy you can responsibly grant and how much verification you must retain.

03 · Horizon

Manage residual misalignment, don't assume zero

No current method eliminates the specification gap; production systems run with residual misalignment as an operating fact. Layered controls — grounding, verification, human gates at consequence boundaries — are how mature deployments convert an unsolved research problem into a managed operational risk.

// common misconceptions

What Alignment is not

Myth

“Alignment is censorship by another name.”

Reality

Alignment is the general discipline of making systems pursue intent — instruction fidelity, honesty, staying on policy. Content moderation is one narrow application. An unaligned model isn't edgy; it's unreliable.

Myth

“A well-aligned model stays aligned.”

Reality

Alignment is behavior under a distribution — jailbreaks, distribution shift, and novel tasks all stress it. It requires continuous evaluation and monitoring, not a one-time certification.

Myth

“Capability and alignment trade off, so alignment costs performance.”

Reality

An “alignment tax” exists in some narrow benchmarks, but in deployment the relationship inverts: a model that follows instructions and stays on policy delivers more usable capability than a stronger model that doesn't. Aligned is what makes capable valuable.

// from literacy to leverage

Know the term. Now build the strategy.

Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.

AI innovation, applied

Alignment

What Alignment actually is

How intent becomes model behavior

The components teams must understand

What this changes for the business

What Alignment is not

Explore the wider architecture

Know the term. Now build the strategy.