About
Hi, I’m Guanni (Jenny) Qu an EA-aligned builder and researcher focused on adversarial evaluation and AI control. I like turning red-team ideas into reproducible, auditable evaluations that measurably reduce risk in deployment.
Now
- AI control & evals: Building a seeded “evergreen jailbreak” harness and control evaluations (tripwires, trusted monitoring, restricted affordances) to push down
P(catastrophe | scheming)
with pre-registered metrics. - Security mindset: DEF CON CTF + RF/hardware (Flipper Zero, Proxmark3) → attacker-style elicitation of jailbreaks, exfil paths, and deceptive behavior.
- Reproducible RL: Clean baselines and scaffolds (DQN → C51/QR-DQN) with seeds/configs/ablations; evaluation discipline over vibes.
Why (EA)
I’m motivated by Effective Altruism: work where the expected impact is high and legible. For near-term safety, I prioritize control evaluations that are black-box, capability-grounded, and iteratively improvable; for the long run, I care about research that makes alignment and governance decisions evaluable rather than speculative.
Research directions
- Control evaluations for agentic LLMs: Red-team → automated stress tests; “caught red-handed” as a win condition; few-shot catastrophe prevention (proliferation, interrogation, retro-monitoring).
- Rogue deployments: Split safety cases into no-rogue and no-catastrophe-under-controls; emphasize underrated rogue internal risks when weight/compute access is broad.
- Distributional RL: C51/QR-DQN reproductions and ablations (support bounds, atoms, n-step) to keep my evaluation practice sharp and honest.
- Foundations: QFT (Cohen–Luty + Steven Weinberg track), knot theory (Seifert Fibered Space), Number Theory (Jean-Pierre Serre A Course in Arithmetic)math/physics grounding for long-horizon reasoning and rigorous writing.
Open question I’m investigating
What is the half-life of jailbreak attack strength under iterative guardrail/model updates, and can an automated, “evergreen” generator predict or extend it?
Selected things I ship
- Evergreen Jailbreak Evals (MVP): seeded attacks → mutations → trusted monitoring → violation/detection metrics. Post.
- DRL repro scaffold: DQN baseline with stubs for C51/QR-DQN, seeds/configs/logging; ablation-ready.
- Survivor-tech MVP: trauma-informed, privacy-first reporting and evidence journaling; outreach to Title IX/survivor orgs.
- GuanniGolf (WeChat): bilingual golf slang explainer series; rapid content iteration and audience growth.
Contact
Email: [email protected]. I’m excited to collaborate on eval engineering, control protocols, and artifact-driven safety research.