Building Production-Ready Probes For Gemini
J\'anos Kram\'ar, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, Arthur Conmy

TL;DR
This paper introduces new probe architectures to improve misuse detection in large language models like Gemini, especially under long-context shifts, and demonstrates their effectiveness in real-world deployment and automation.
Contribution
It proposes novel probe architectures that better handle long-context shifts and shows how combining them with prompted classifiers enhances robustness and efficiency.
Findings
Proposed architectures improve detection under long-context shifts.
Combining probes with prompted classifiers yields high accuracy with low computational cost.
Automated methods like AlphaEvolve show promise for AI safety research automation.
Abstract
Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may be a promising misuse mitigation technique, but we identify a key remaining challenge: probes fail to generalize under important production distribution shifts. In particular, we find that the shift from short-context to long-context inputs is difficult for existing probe architectures. We propose several new probe architectures that handle this long-context distribution shift. We evaluate these probes in the cyber-offensive domain, testing their robustness against various production-relevant distribution shifts, including multi-turn conversations, long context prompts, and adaptive red teaming. Our results demonstrate that while our novel architectures address context length, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Spam and Phishing Detection
