// term 06 · Safety & Alignment

RLHF

Reinforcement Learning from Human Feedback

The post-training technique that turned raw text predictors into usable assistants: human raters rank model outputs, a reward model learns those preferences, and reinforcement learning optimizes the LLM against that learned reward — aligning behavior with helpfulness, safety, and intent.

AlignmentReward ModelPreferencesPost-Training

// Preference data

100K–1M+

Human comparison judgments behind a frontier-grade alignment run — among the most expensive datasets in the entire pipeline.

// Pipeline

3 stages

Supervised fine-tuning, reward model training, then RL optimization — each with distinct data requirements and failure modes.

// Impact

ChatGPT

The technique that made conversational AI viable. The gap between GPT-3 and ChatGPT was alignment, not scale.

// full definition

What RLHF actually is

A freshly pre-trained LLM is a text predictor, not an assistant — ask it a question and it may continue with more questions, because that is what its training distribution suggests. RLHF closes the gap between predicting text and being useful. It begins with supervised fine-tuning on human-written demonstrations, teaching the model the assistant format: answer the question, follow the instruction, refuse appropriately.

The core innovation is the reward model. Rather than writing ideal answers — slow and expensive — human raters simply compare pairs of model outputs and pick the better one. A separate network trains on these comparisons until it can predict human preference for any output, converting subjective judgment into a differentiable score. Reinforcement learning (classically PPO) then optimizes the assistant to maximize that score, with a KL penalty tethering it to the original model so capability and fluency survive the process.

RLHF is also where the failure modes of modern assistants originate. The model optimizes a proxy — the reward model — and proxies get gamed: verbose padding, confident hedging, and flattering agreement (sycophancy) all score well with raters while serving users poorly. Successor techniques like DPO simplify the pipeline and Constitutional AI replaces some human labeling with principle-guided self-critique, but the proxy-optimization tension is permanent.

For buyers and builders, the practical insight is that alignment is a product surface. How a model refuses, hedges, apologizes, and persuades was decided in someone's preference data and rater guidelines. Vendor selection inherits those decisions — evaluate refusal behavior and tone on your actual workloads. And organizations fine-tuning with their own preference data are doing boutique RLHF: owning the resulting behavior, and the responsibility for it.

// how it works

From raw predictor to aligned assistant

RLHF is a three-model pipeline that converts human judgment into a trainable signal — and the reason modern assistants follow instructions at all.

Supervised Fine-Tuning

Human-written demonstrations teach the base model the assistant format: answer questions, follow instructions, refuse appropriately.

Preference Collection

Raters compare pairs of model responses to the same prompt and pick the better one — cheaper and more consistent than writing ideal answers from scratch.

Reward Model Training

A separate model learns to predict human preference, converting subjective judgment into a differentiable score any output can receive.

RL Optimization

The assistant is trained — classically with PPO — to maximize reward-model scores, with a KL penalty tethering it to the original model to prevent capability collapse.

Red-Team Evaluation

Adversarial probing hunts for reward hacking, sycophancy, and safety regressions that the preference metric missed.

Iterate

Newly discovered failure modes feed back into preference data and rater guidelines. Alignment is a continuous program, not a final training step.

// anatomy

The components teams must understand

Preference Dataset

Human judgment, encoded

Pairwise comparisons across diverse prompts. Rater quality, diversity, and guidelines directly shape model values, tone, and refusal behavior.

Reward Model

The learned critic

A proxy for human preference that scores any output instantly. Its blind spots become the policy's exploits — the central vulnerability of the whole pipeline.

Policy Optimization

PPO and successors

The RL machinery nudging the model toward higher reward. Variants like DPO skip the explicit reward model for simpler, more stable training.

KL Penalty

The anchor

Penalizes divergence from the pre-trained model, preserving capability and fluency while behavior shifts toward human preference.

Rater Workforce

Humans in the loop

Thousands of contractors following detailed guidelines. Those guideline documents are de facto policy decisions about model values and voice.

Reward-Hacking Monitor

Goodhart's law defense

Detection for outputs that score high while serving users poorly — verbose padding, confident hedging, flattering agreement.

// strategic implications

What this changes for the business

01 · Control

Alignment is where brand voice lives

RLHF and its successors determine how a model refuses, hedges, apologizes, and persuades. Selecting a vendor means inheriting their alignment choices — evaluate refusal behavior, tone, and safety posture on your actual workloads, not just capability benchmarks. The difference between assistants is increasingly post-training, not pre-training.

02 · Quality

Optimized for preference, not truth

RLHF makes outputs people prefer — which correlates imperfectly with accuracy and actively breeds sycophancy. Systems requiring objective correctness need grounding and verification layers on top of any aligned model. Never assume the agreeable answer is the accurate one.

03 · Strategy

Post-training is the differentiator

Base capabilities are converging across labs; alignment quality increasingly separates products. The same dynamic applies internally: organizations tuning with their own preference data are doing boutique RLHF — owning the resulting behavior and converting institutional judgment into a model asset.

// common misconceptions

What RLHF is not

Myth

“RLHF makes models truthful.”

Reality

It makes models preferred. Raters reward confident, agreeable, well-formatted answers — which is exactly why aligned models can be sycophantic and polished while wrong. Truthfulness requires grounding and verification, not just alignment.

Myth

“Alignment is solved at training time.”

Reality

RLHF shapes default behavior; it does not guarantee it. Jailbreaks, distribution shift, and reward hacking keep alignment a continuous operational concern that extends through deployment and monitoring.

Myth

“RLHF adds knowledge to the model.”

Reality

Post-training redistributes and styles existing capability — it teaches the model how to behave, not new facts. Knowledge comes from pre-training and retrieval; behavior comes from alignment.

// from literacy to leverage

Know the term. Now build the strategy.

Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.

AI innovation, applied

RLHF

What RLHF actually is

From raw predictor to aligned assistant

The components teams must understand

What this changes for the business

What RLHF is not

Explore the wider architecture

Know the term. Now build the strategy.