// term 90 · Safety & Alignment

Red Teaming

Adversarial AI Testing

Deliberately attacking AI systems before adversaries do — probing for jailbreaks, injection paths, harmful outputs, bias, and misuse potential. Red teaming is offensive testing in service of defense: the discipline that finds what evaluations miss, because it looks the way attackers look.

Adversarial TestingJailbreaksSecurityAssurance

// Coverage

what evals miss

Benchmarks measure expected behavior; red teams hunt the unexpected — novel attacks, edge conditions, creative misuse.

// Method

human + automated

Expert attackers for creativity, automated generation for scale — modern programs run both in combination.

// Status

expected

Frontier-lab practice, EU AI Act adversarial-testing expectations, and sector regulation — red teaming moved from optional to assumed.

// full definition

What Red Teaming actually is

Standard evaluation asks whether the system does what it should; red teaming asks how it can be made to do what it shouldn't. The distinction is the attacker's mindset: benchmarks probe expected behavior on known distributions, while red teams hunt the unexpected — the jailbreak phrasing nobody anticipated, the injection path through a retrieved document, the multi-turn manipulation that erodes refusals a single prompt couldn't break. What ships without adversarial testing has been tested only by friends.

The attack surface is distinctively AI-shaped. Jailbreaks coerce models past their alignment — roleplay framings, encoding tricks, persona manipulation. Prompt injection plants instructions in content the system will read — documents, emails, web pages — hijacking tool-equipped agents through their own retrieval. Data extraction probes for training-data leakage and system-prompt disclosure. Bias probing surfaces differential behavior across demographics. And for agentic systems, the stakes compound: a successful attack doesn't just produce bad text — it triggers actions with the agent's permissions.

Method has matured into a discipline. Scoping defines targets and rules of engagement; campaigns combine expert human creativity (the novel attacks automation can't imagine) with automated adversarial generation (the scale humans can't cover) — often using attacker models to probe defender models. Findings triage by severity into engineering fixes — guardrail rules, alignment data, architecture changes — and re-testing verifies the fix without regression elsewhere. The loop is continuous by necessity: every model upgrade, prompt change, and new tool resets the attack surface.

The external pressure made it non-optional: frontier labs run formal red teams and publish methodologies, the EU AI Act expects adversarial testing for high-risk and general-purpose systems, and sectoral regulators increasingly ask for the evidence. The strategic frame mirrors penetration testing's history — once exotic, now table stakes — with the same maturity curve: ad-hoc probing, then structured campaigns, then continuous adversarial assurance integrated into the deployment lifecycle. Organizations that find their failures first fix them quietly; the alternative is finding them in the headlines.

// how it works

Attacking your own AI first

Red teaming runs as a campaign — scoped targets, adversarial probing across attack classes, findings triaged into fixes, and re-tests proving the fixes hold.

Scoping

Targets, attack classes, and rules of engagement defined — which systems, which harms, what's in and out of bounds.

Threat Modeling

Who attacks this system and why — the adversary profiles and misuse scenarios that focus the campaign.

Adversarial Probing

Human experts and automated generators attack across classes — jailbreaks, injection, extraction, bias, agent manipulation.

Finding Documentation

Successful attacks record with reproduction steps and severity — evidence engineering can act on.

Remediation

Fixes land across the stack — guardrail rules, alignment data, architectural changes — matched to each finding's root.

Re-Test & Recur

Fixes verify without regression, and the cycle re-runs on every material change — adversarial assurance as a standing loop.

// anatomy

The components teams must understand

Jailbreak Probing

Alignment under attack

Roleplay, encoding, and persona techniques coercing models past refusals — the classic class, endlessly renewing.

Injection Testing

The indirect path

Hostile instructions planted in content the system reads — the attack class that turns RAG and tools into entry points.

Extraction Attacks

Leakage hunting

Probing for training data, system prompts, and other users' context — confidentiality tested adversarially.

Agent Manipulation

Attacks with consequences

Coercing tool-equipped systems into harmful actions — where successful attacks execute instead of just speak.

Automated Adversaries

Scale for the search

Attacker models generating and mutating exploits at volume — coverage human creativity directs but can't supply alone.

Findings Pipeline

Attack to fix

Severity triage, remediation routing, and re-test verification — the machinery that converts breaks into hardening.

// strategic implications

What this changes for the business

01 · Assurance

Untested by adversaries means untested

Benchmarks and QA probe expected behavior; attackers don't behave expectedly. For any AI touching customers, decisions, or tools, adversarial testing is the assurance layer that finds what friendly evaluation structurally cannot.

02 · Priority

Agentic systems raise the stakes

When outputs trigger actions, successful attacks execute — injection through a retrieved document becomes transactions, communications, and changes made with the agent's permissions. Red-team tool-equipped systems first, and hardest.

03 · Program

Make it a loop, not an event

Every model upgrade, prompt change, and new integration resets the attack surface — one-time assessments age immediately. Budget red teaming as a continuous program with re-test triggers, the way penetration testing matured a decade earlier.

// common misconceptions

What Red Teaming is not

Myth

“Our models passed safety evals, so we're covered.”

Reality

Evaluations measure known behaviors on expected inputs; red teams find the novel phrasings, indirect paths, and creative misuse evals never encode. The two are complements — and the gap between them is where incidents live.

Myth

“Red teaming is for frontier labs.”

Reality

Deployment-layer attacks — injection through your documents, manipulation of your agent's tools, extraction of your prompts — target your system, not the base model. Vendor red teaming doesn't transfer; your integration is yours to test.

Myth

“Fixed findings stay fixed.”

Reality

Patches narrow specific exploits while upgrades, new tools, and adaptive attackers reopen the surface — adversarial robustness decays without re-testing. The program is continuous or it's commemorative.

// from literacy to leverage

Know the term. Now build the strategy.

Vocabulary is the entry fee. Turning these primitives into pipeline, moats, and margin is the work. That's the conversation.

AI innovation, applied

Red Teaming

What Red Teaming actually is

Attacking your own AI first

The components teams must understand

What this changes for the business

What Red Teaming is not

Explore the wider architecture

Know the term. Now build the strategy.