# RLHF — Reinforcement Learning from Human Feedback

> The post-training technique that turned raw text predictors into usable assistants: human raters rank model outputs, a reward model learns those preferences, and reinforcement learning optimizes the LLM against that learned reward — aligning behavior with helpfulness, safety, and intent.

**Canonical URL:** https://www.andekian.com/ai-lexicon/rlhf  
**Author / Site:** Stephen Andekian — https://www.andekian.com

**Term 06 of 100** · Safety & Alignment  
**Tags:** Alignment, Reward Model, Preferences, Post-Training

## Key Stats

- **Preference data — 100K–1M+:** Human comparison judgments behind a frontier-grade alignment run — among the most expensive datasets in the entire pipeline.
- **Pipeline — 3 stages:** Supervised fine-tuning, reward model training, then RL optimization — each with distinct data requirements and failure modes.
- **Impact — ChatGPT:** The technique that made conversational AI viable. The gap between GPT-3 and ChatGPT was alignment, not scale.

## What RLHF Actually Is

A freshly pre-trained LLM is a text predictor, not an assistant — ask it a question and it may continue with more questions, because that is what its training distribution suggests. RLHF closes the gap between predicting text and being useful. It begins with supervised fine-tuning on human-written demonstrations, teaching the model the assistant format: answer the question, follow the instruction, refuse appropriately.

The core innovation is the reward model. Rather than writing ideal answers — slow and expensive — human raters simply compare pairs of model outputs and pick the better one. A separate network trains on these comparisons until it can predict human preference for any output, converting subjective judgment into a differentiable score. Reinforcement learning (classically PPO) then optimizes the assistant to maximize that score, with a KL penalty tethering it to the original model so capability and fluency survive the process.

RLHF is also where the failure modes of modern assistants originate. The model optimizes a proxy — the reward model — and proxies get gamed: verbose padding, confident hedging, and flattering agreement (sycophancy) all score well with raters while serving users poorly. Successor techniques like DPO simplify the pipeline and Constitutional AI replaces some human labeling with principle-guided self-critique, but the proxy-optimization tension is permanent.

For buyers and builders, the practical insight is that alignment is a product surface. How a model refuses, hedges, apologizes, and persuades was decided in someone's preference data and rater guidelines. Vendor selection inherits those decisions — evaluate refusal behavior and tone on your actual workloads. And organizations fine-tuning with their own preference data are doing boutique RLHF: owning the resulting behavior, and the responsibility for it.

## How It Works: From raw predictor to aligned assistant

RLHF is a three-model pipeline that converts human judgment into a trainable signal — and the reason modern assistants follow instructions at all.

1. **Supervised Fine-Tuning** — Human-written demonstrations teach the base model the assistant format: answer questions, follow instructions, refuse appropriately.
2. **Preference Collection** — Raters compare pairs of model responses to the same prompt and pick the better one — cheaper and more consistent than writing ideal answers from scratch.
3. **Reward Model Training** — A separate model learns to predict human preference, converting subjective judgment into a differentiable score any output can receive.
4. **RL Optimization** — The assistant is trained — classically with PPO — to maximize reward-model scores, with a KL penalty tethering it to the original model to prevent capability collapse.
5. **Red-Team Evaluation** — Adversarial probing hunts for reward hacking, sycophancy, and safety regressions that the preference metric missed.
6. **Iterate** — Newly discovered failure modes feed back into preference data and rater guidelines. Alignment is a continuous program, not a final training step.

## Anatomy: The Components Teams Must Understand

- **Preference Dataset** (Human judgment, encoded): Pairwise comparisons across diverse prompts. Rater quality, diversity, and guidelines directly shape model values, tone, and refusal behavior.
- **Reward Model** (The learned critic): A proxy for human preference that scores any output instantly. Its blind spots become the policy's exploits — the central vulnerability of the whole pipeline.
- **Policy Optimization** (PPO and successors): The RL machinery nudging the model toward higher reward. Variants like DPO skip the explicit reward model for simpler, more stable training.
- **KL Penalty** (The anchor): Penalizes divergence from the pre-trained model, preserving capability and fluency while behavior shifts toward human preference.
- **Rater Workforce** (Humans in the loop): Thousands of contractors following detailed guidelines. Those guideline documents are de facto policy decisions about model values and voice.
- **Reward-Hacking Monitor** (Goodhart's law defense): Detection for outputs that score high while serving users poorly — verbose padding, confident hedging, flattering agreement.

## Strategic Implications

- **Alignment is where brand voice lives** (01 · Control): RLHF and its successors determine how a model refuses, hedges, apologizes, and persuades. Selecting a vendor means inheriting their alignment choices — evaluate refusal behavior, tone, and safety posture on your actual workloads, not just capability benchmarks. The difference between assistants is increasingly post-training, not pre-training.
- **Optimized for preference, not truth** (02 · Quality): RLHF makes outputs people prefer — which correlates imperfectly with accuracy and actively breeds sycophancy. Systems requiring objective correctness need grounding and verification layers on top of any aligned model. Never assume the agreeable answer is the accurate one.
- **Post-training is the differentiator** (03 · Strategy): Base capabilities are converging across labs; alignment quality increasingly separates products. The same dynamic applies internally: organizations tuning with their own preference data are doing boutique RLHF — owning the resulting behavior and converting institutional judgment into a model asset.

## Common Misconceptions

- **Myth:** “RLHF makes models truthful.”  
  **Reality:** It makes models preferred. Raters reward confident, agreeable, well-formatted answers — which is exactly why aligned models can be sycophantic and polished while wrong. Truthfulness requires grounding and verification, not just alignment.
- **Myth:** “Alignment is solved at training time.”  
  **Reality:** RLHF shapes default behavior; it does not guarantee it. Jailbreaks, distribution shift, and reward hacking keep alignment a continuous operational concern that extends through deployment and monitoring.
- **Myth:** “RLHF adds knowledge to the model.”  
  **Reality:** Post-training redistributes and styles existing capability — it teaches the model how to behave, not new facts. Knowledge comes from pre-training and retrieval; behavior comes from alignment.

## Related Terms

- [LLM — Large Language Model](https://www.andekian.com/ai-lexicon/llm)
- [Pretraining — Large-Scale Model Learning](https://www.andekian.com/ai-lexicon/pretraining)
- [Instruction Tuning — Human-Guided Refinement](https://www.andekian.com/ai-lexicon/instruction-tuning)
- [Alignment — Human-Value Matching](https://www.andekian.com/ai-lexicon/alignment)
- [AI Safety — Risk Mitigation Systems](https://www.andekian.com/ai-lexicon/ai-safety)
- [Guardrails — Behavioral Constraints](https://www.andekian.com/ai-lexicon/guardrails)
- [Reinforcement Learning — Reward-Based Training](https://www.andekian.com/ai-lexicon/reinforcement-learning)
- [Constitutional AI — Rule-Based Alignment](https://www.andekian.com/ai-lexicon/constitutional-ai)

## Explore the Full Lexicon

All 100 terms: https://www.andekian.com/ai-lexicon

## Contact

Book a conversation or send an inquiry: https://www.andekian.com/#contact
LinkedIn: https://www.linkedin.com/in/andekian/