// term 90 · Safety & Alignment
Red Teaming
Adversarial AI Testing
Deliberately attacking AI systems before adversaries do — probing for jailbreaks, injection paths, harmful outputs, bias, and misuse potential. Red teaming is offensive testing in service of defense: the discipline that finds what evaluations miss, because it looks the way attackers look.
// Coverage
what evals miss
Benchmarks measure expected behavior; red teams hunt the unexpected — novel attacks, edge conditions, creative misuse.
// Method
human + automated
Expert attackers for creativity, automated generation for scale — modern programs run both in combination.
// Status
expected
Frontier-lab practice, EU AI Act adversarial-testing expectations, and sector regulation — red teaming moved from optional to assumed.
// full definition
What Red Teaming actually is
Standard evaluation asks whether the system does what it should; red teaming asks how it can be made to do what it shouldn't. The distinction is the attacker's mindset: benchmarks probe expected behavior on known distributions, while red teams hunt the unexpected — the jailbreak phrasing nobody anticipated, the injection path through a retrieved document, the multi-turn manipulation that erodes refusals a single prompt couldn't break. What ships without adversarial testing has been tested only by friends.
The attack surface is distinctively AI-shaped. Jailbreaks coerce models past their alignment — roleplay framings, encoding tricks, persona manipulation. Prompt injection plants instructions in content the system will read — documents, emails, web pages — hijacking tool-equipped agents through their own retrieval. Data extraction probes for training-data leakage and system-prompt disclosure. Bias probing surfaces differential behavior across demographics. And for agentic systems, the stakes compound: a successful attack doesn't just produce bad text — it triggers actions with the agent's permissions.
Method has matured into a discipline. Scoping defines targets and rules of engagement; campaigns combine expert human creativity (the novel attacks automation can't imagine) with automated adversarial generation (the scale humans can't cover) — often using attacker models to probe defender models. Findings triage by severity into engineering fixes — guardrail rules, alignment data, architecture changes — and re-testing verifies the fix without regression elsewhere. The loop is continuous by necessity: every model upgrade, prompt change, and new tool resets the attack surface.
The external pressure made it non-optional: frontier labs run formal red teams and publish methodologies, the EU AI Act expects adversarial testing for high-risk and general-purpose systems, and sectoral regulators increasingly ask for the evidence. The strategic frame mirrors penetration testing's history — once exotic, now table stakes — with the same maturity curve: ad-hoc probing, then structured campaigns, then continuous adversarial assurance integrated into the deployment lifecycle. Organizations that find their failures first fix them quietly; the alternative is finding them in the headlines.
// how it works
Attacking your own AI first
Red teaming runs as a campaign — scoped targets, adversarial probing across attack classes, findings triaged into fixes, and re-tests proving the fixes hold.
Scoping
Targets, attack classes, and rules of engagement defined — which systems, which harms, what's in and out of bounds.
Threat Modeling
Who attacks this system and why — the adversary profiles and misuse scenarios that focus the campaign.
Adversarial Probing
Human experts and automated generators attack across classes — jailbreaks, injection, extraction, bias, agent manipulation.
Finding Documentation
Successful attacks record with reproduction steps and severity — evidence engineering can act on.
Remediation
Fixes land across the stack — guardrail rules, alignment data, architectural changes — matched to each finding's root.
Re-Test & Recur
Fixes verify without regression, and the cycle re-runs on every material change — adversarial assurance as a standing loop.
// anatomy
The components teams must understand
01
Jailbreak Probing
Alignment under attack
Roleplay, encoding, and persona techniques coercing models past refusals — the classic class, endlessly renewing.
02
Injection Testing
The indirect path
Hostile instructions planted in content the system reads — the attack class that turns RAG and tools into entry points.
03
Extraction Attacks
Leakage hunting
Probing for training data, system prompts, and other users' context — confidentiality tested adversarially.
04
Agent Manipulation
Attacks with consequences
Coercing tool-equipped systems into harmful actions — where successful attacks execute instead of just speak.
05
Automated Adversaries
Scale for the search
Attacker models generating and mutating exploits at volume — coverage human creativity directs but can't supply alone.
06
Findings Pipeline
Attack to fix
Severity triage, remediation routing, and re-test verification — the machinery that converts breaks into hardening.
// strategic implications
What this changes for the business
01 · Assurance
Untested by adversaries means untested
Benchmarks and QA probe expected behavior; attackers don't behave expectedly. For any AI touching customers, decisions, or tools, adversarial testing is the assurance layer that finds what friendly evaluation structurally cannot.
02 · Priority
Agentic systems raise the stakes
When outputs trigger actions, successful attacks execute — injection through a retrieved document becomes transactions, communications, and changes made with the agent's permissions. Red-team tool-equipped systems first, and hardest.
03 · Program
Make it a loop, not an event
Every model upgrade, prompt change, and new integration resets the attack surface — one-time assessments age immediately. Budget red teaming as a continuous program with re-test triggers, the way penetration testing matured a decade earlier.
// common misconceptions
What Red Teaming is not
Myth
“Our models passed safety evals, so we're covered.”
Reality
Evaluations measure known behaviors on expected inputs; red teams find the novel phrasings, indirect paths, and creative misuse evals never encode. The two are complements — and the gap between them is where incidents live.
Myth
“Red teaming is for frontier labs.”
Reality
Deployment-layer attacks — injection through your documents, manipulation of your agent's tools, extraction of your prompts — target your system, not the base model. Vendor red teaming doesn't transfer; your integration is yours to test.
Myth
“Fixed findings stay fixed.”
Reality
Patches narrow specific exploits while upgrades, new tools, and adaptive attackers reopen the surface — adversarial robustness decays without re-testing. The program is continuous or it's commemorative.
// from literacy to leverage
Know the term. Now build the strategy.
Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.