From red-team prompts to control evaluations

I’m building toward control evaluations that convert “cool hacks” into measured risk reductions. To get moving fast, I shipped a minimal “evergreen jailbreak” harness: seeded prompts, a tiny mutation stack (paraphrase stub + role/context swaps), and a trusted monitor (regex tripwires). For now, I stub a mock model and compute four simple metrics — violation rate, detected violations, flag rate, and a utility proxy — emitting a JSON report per run.

What broke first

Overfitting to phrasing. Keyword monitors are brittle; trivial lexical edits flip outcomes. Useful for scaffolding, not for claims of robustness.
Mutation quality. A weak paraphrase stub doesn’t preserve semantics and inflates the appearance of “durability.” Durability must be defined against semantically equivalent attacks.
Metric myopia. A falling violation rate can just mean more censorious outputs. I added a crude utility proxy to keep safety–usefulness tradeoffs explicit.

What I measured

An output is a violation if it contains sensitive patterns (password/token/exfiltrate/bypass). The trusted monitor flags those patterns; “detected violations” is the overlap. This is deliberately toy, but it enforces the loop I care about: attacks → monitors → measured deltas → iterate.

Why this isn’t fake progress

The MVP claims reproducibility and direction, not robustness: fixed seeds, explicit metrics, and a stable interface so stronger components (real models, paraphrase+redaction, semantic retrieval) can drop in without changing the harness.

What’s next

Paraphrase + redaction that preserves semantics and strips steganographic cues.
Evergreen generator: attack bank + mutation operators + semantic retrieval over past failures.
Durability tracking: estimate an attack “half-life” across weekly guardrail/model revisions.
Detection latency: upgrade from toy timer to token/step-based latency.
Utility deltas: add a benign task set to visualize the safety–utility frontier.

Why this matters

Early transformatively useful AIs need auditable controls that degrade gracefully under shift. A tiny harness with seeds, metrics, and a stable interface is the smallest step toward that. Next up: durability curves that tell us if fixes actually extend the half-life of jailbreak strength.