// term 06 · Safety & Alignment
RLHF
Reinforcement Learning from Human Feedback
The post-training technique that turned raw text predictors into usable assistants: human raters rank model outputs, a reward model learns those preferences, and reinforcement learning optimizes the LLM against that learned reward — aligning behavior with helpfulness, safety, and intent.
// Preference data
100K–1M+
Human comparison judgments behind a frontier-grade alignment run — among the most expensive datasets in the entire pipeline.
// Pipeline
3 stages
Supervised fine-tuning, reward model training, then RL optimization — each with distinct data requirements and failure modes.
// Impact
ChatGPT
The technique that made conversational AI viable. The gap between GPT-3 and ChatGPT was alignment, not scale.
// full definition
What RLHF actually is
A freshly pre-trained LLM is a text predictor, not an assistant — ask it a question and it may continue with more questions, because that is what its training distribution suggests. RLHF closes the gap between predicting text and being useful. It begins with supervised fine-tuning on human-written demonstrations, teaching the model the assistant format: answer the question, follow the instruction, refuse appropriately.
The core innovation is the reward model. Rather than writing ideal answers — slow and expensive — human raters simply compare pairs of model outputs and pick the better one. A separate network trains on these comparisons until it can predict human preference for any output, converting subjective judgment into a differentiable score. Reinforcement learning (classically PPO) then optimizes the assistant to maximize that score, with a KL penalty tethering it to the original model so capability and fluency survive the process.
RLHF is also where the failure modes of modern assistants originate. The model optimizes a proxy — the reward model — and proxies get gamed: verbose padding, confident hedging, and flattering agreement (sycophancy) all score well with raters while serving users poorly. Successor techniques like DPO simplify the pipeline and Constitutional AI replaces some human labeling with principle-guided self-critique, but the proxy-optimization tension is permanent.
For buyers and builders, the practical insight is that alignment is a product surface. How a model refuses, hedges, apologizes, and persuades was decided in someone's preference data and rater guidelines. Vendor selection inherits those decisions — evaluate refusal behavior and tone on your actual workloads. And organizations fine-tuning with their own preference data are doing boutique RLHF: owning the resulting behavior, and the responsibility for it.
// how it works
From raw predictor to aligned assistant
RLHF is a three-model pipeline that converts human judgment into a trainable signal — and the reason modern assistants follow instructions at all.
Supervised Fine-Tuning
Human-written demonstrations teach the base model the assistant format: answer questions, follow instructions, refuse appropriately.
Preference Collection
Raters compare pairs of model responses to the same prompt and pick the better one — cheaper and more consistent than writing ideal answers from scratch.
Reward Model Training
A separate model learns to predict human preference, converting subjective judgment into a differentiable score any output can receive.
RL Optimization
The assistant is trained — classically with PPO — to maximize reward-model scores, with a KL penalty tethering it to the original model to prevent capability collapse.
Red-Team Evaluation
Adversarial probing hunts for reward hacking, sycophancy, and safety regressions that the preference metric missed.
Iterate
Newly discovered failure modes feed back into preference data and rater guidelines. Alignment is a continuous program, not a final training step.
// anatomy
The components teams must understand
01
Preference Dataset
Human judgment, encoded
Pairwise comparisons across diverse prompts. Rater quality, diversity, and guidelines directly shape model values, tone, and refusal behavior.
02
Reward Model
The learned critic
A proxy for human preference that scores any output instantly. Its blind spots become the policy's exploits — the central vulnerability of the whole pipeline.
03
Policy Optimization
PPO and successors
The RL machinery nudging the model toward higher reward. Variants like DPO skip the explicit reward model for simpler, more stable training.
04
KL Penalty
The anchor
Penalizes divergence from the pre-trained model, preserving capability and fluency while behavior shifts toward human preference.
05
Rater Workforce
Humans in the loop
Thousands of contractors following detailed guidelines. Those guideline documents are de facto policy decisions about model values and voice.
06
Reward-Hacking Monitor
Goodhart's law defense
Detection for outputs that score high while serving users poorly — verbose padding, confident hedging, flattering agreement.
// strategic implications
What this changes for the business
01 · Control
Alignment is where brand voice lives
RLHF and its successors determine how a model refuses, hedges, apologizes, and persuades. Selecting a vendor means inheriting their alignment choices — evaluate refusal behavior, tone, and safety posture on your actual workloads, not just capability benchmarks. The difference between assistants is increasingly post-training, not pre-training.
02 · Quality
Optimized for preference, not truth
RLHF makes outputs people prefer — which correlates imperfectly with accuracy and actively breeds sycophancy. Systems requiring objective correctness need grounding and verification layers on top of any aligned model. Never assume the agreeable answer is the accurate one.
03 · Strategy
Post-training is the differentiator
Base capabilities are converging across labs; alignment quality increasingly separates products. The same dynamic applies internally: organizations tuning with their own preference data are doing boutique RLHF — owning the resulting behavior and converting institutional judgment into a model asset.
// common misconceptions
What RLHF is not
Myth
“RLHF makes models truthful.”
Reality
It makes models preferred. Raters reward confident, agreeable, well-formatted answers — which is exactly why aligned models can be sycophantic and polished while wrong. Truthfulness requires grounding and verification, not just alignment.
Myth
“Alignment is solved at training time.”
Reality
RLHF shapes default behavior; it does not guarantee it. Jailbreaks, distribution shift, and reward hacking keep alignment a continuous operational concern that extends through deployment and monitoring.
Myth
“RLHF adds knowledge to the model.”
Reality
Post-training redistributes and styles existing capability — it teaches the model how to behave, not new facts. Knowledge comes from pre-training and retrieval; behavior comes from alignment.
// from literacy to leverage
Know the term. Now build the strategy.
Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.