// term 95 · Training & Optimization

Reinforcement Learning

Reward-Based Training

Training agents through consequences — actions taken, rewards received, behavior adjusted toward what works. Reinforcement learning is the paradigm for sequential decision-making, the engine behind AlphaGo and robotic control, and the alignment machinery (via RLHF and successors) that shaped every modern assistant.

RewardsSequential DecisionsRLHF LineageAgents

// Signal

reward

No labeled answers — just consequences. The agent discovers what works by acting and being scored.

// Landmark

AlphaGo

Superhuman play discovered through self-play RL — strategies no human taught because no human knew them.

// Current role

post-training

RLHF, RLAIF, and reasoning-model optimization — RL is the machinery refining how frontier models behave and think.

// full definition

What Reinforcement Learning actually is

Supervised learning teaches by showing correct answers; reinforcement learning teaches by consequences. An agent acts in an environment, receives reward signals — points scored, tasks completed, preferences satisfied — and adjusts its policy toward whatever earns more. Nobody labels the right action, because nobody knows it: the agent searches the space of behaviors, and the reward function defines what success means. It is the natural paradigm wherever decisions are sequential and the payoff arrives later — games, robotics, resource allocation, dialogue.

The paradigm's signature challenges shape its engineering. Credit assignment: when reward arrives at the end of a long sequence, which earlier decisions deserve it? Exploration versus exploitation: act on the best-known strategy, or try alternatives that might beat it? And most consequentially, reward design: agents optimize the reward exactly as written, not as intended — and the literature overflows with agents exploiting loopholes (reward hacking) in objectives their designers thought were clear. The reward function is the contract, and the agent is the most literal-minded counterparty imaginable.

RL's most visible triumph was discovery beyond human teaching — AlphaGo's self-play producing strategies professionals had never imagined, robotic controllers learning locomotion no engineer specified. Its most consequential application is quieter: post-training language models. RLHF — reward models learned from human preferences, optimized via policy-gradient methods — turned raw predictors into aligned assistants, and its successors (RLAIF, DPO, and the RL machinery behind extended-reasoning models) continue refining how frontier systems behave and think. The fingerprints of reward optimization are on every assistant's tone, refusals, and reasoning style.

For organizations, RL matters on two timelines. Now, as literacy: understanding RLHF is understanding why assistants behave as they do — agreeable, refusal-prone, sycophantic at the edges — these are reward-optimization signatures, not personality. Ahead, as capability: agentic AI's trajectory points toward systems improving from outcome feedback — task completions, user corrections, operational results — RL's loop wearing production clothes. The discipline that travels with it: reward specification is goal specification, and Goodhart's law collects on every gap between what you measured and what you meant.

// how it works

Learning from consequences

RL runs on a feedback loop — act, observe, collect reward, update policy — repeated until behavior that earns reward becomes behavior, period.

01

Environment & Actions

The world the agent operates in and the moves available to it — the stage on which behavior will be discovered.

02

Reward Definition

Success encoded as signal — the function the agent will optimize exactly as written, loopholes included.

03

Exploration

The agent tries behaviors — balancing known-good strategies against the search for better ones.

04

Credit Assignment

Outcomes propagate back across the decisions that produced them — late rewards attributed to early choices.

05

Policy Update

Behavior adjusts toward reward — gradient methods nudging the policy until earning becomes habit.

06

Evaluation & Guarding

Learned behavior audits against intent — reward hacking hunted, side effects surfaced, the contract checked.

// anatomy

The components teams must understand

01

Policy

Behavior as function

The agent's mapping from situations to actions — the artifact RL trains, and the thing that ships.

02

Reward Function

The literal contract

Success, encoded — optimized exactly as written. Design quality here decides whether you get what you meant or what you measured.

03

Exploration Strategy

The search dial

How much the agent experiments versus exploits — too little misses better strategies, too much never settles.

04

Value Estimation

Long-game accounting

Expected future reward per state and action — the machinery that lets agents sacrifice now to win later.

05

Reward Hacking

Goodhart, weaponized

Loopholes exploited in the written objective — the canonical RL failure, and alignment's central cautionary tale.

06

RLHF Lineage

RL meets language

Preference-learned rewards optimizing model behavior — the application that shaped every modern assistant.

// strategic implications

What this changes for the business

01 · Literacy

Assistant behavior is reward archaeology

Agreeableness, refusal patterns, and sycophancy are signatures of preference optimization — understanding RLHF explains the products you deploy and the vendor differences you evaluate. The reward pipeline, more than the architecture, is where assistant character comes from.

02 · Specification

Rewards are optimized literally

Every gap between the metric and the intent becomes behavior — in RL agents, in RLHF'd assistants, and in any system improving from outcome feedback. Reward design is goal specification with consequences; audit what you're actually incentivizing before the optimizer does.

03 · Horizon

Outcome-driven improvement is the agentic trajectory

Agents that learn from completions, corrections, and operational results are RL's loop entering production — bringing its power and its specification hazards. The reward-design discipline is worth building before the systems that need it arrive.

// common misconceptions

What Reinforcement Learning is not

Myth

“RL is just trial and error.”

Reality

The trials are structured by value estimation, credit assignment, and exploration strategy — mathematical machinery that makes search tractable in spaces where blind trial would never converge. The error is doing the teaching, systematically.

Myth

“RL agents understand the goal.”

Reality

They optimize the written reward — loopholes, side effects, and all. Reward hacking isn't misbehavior; it's the contract enforced as written. The understanding lives (or fails) in the specification.

Myth

“RL was a games-and-robots niche until it wasn't.”

Reality

The through-line is direct: the methods that mastered Go became the machinery that aligned language models. RLHF is RL — the niche turned out to be the foundation of every assistant's behavior.

// from literacy to leverage

Know the term. Now build the strategy.

Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.

AI innovation, applied
Andekian

AI-first digital transformation for enterprise growth. Strategy and execution, under one operator.

© 2026 Stephen Andekian.