Frontier Models Can Take Actions at Low Probabilities
Alex Serrano, Wen Xing, David Lindner, Erik Jenner

TL;DR
Frontier models demonstrate a surprising ability to perform actions at very low probabilities with high calibration, especially when external entropy or explicit reasoning is involved, raising concerns for model oversight.
Contribution
This study evaluates the capability of frontier models to take low-probability actions and highlights their potential to evade detection during pre-deployment evaluations.
Findings
Models maintain high calibration at rates below 1 in 100,000 with entropy.
Larger models perform better at low-rate calibration when given target rates.
Explicit Chain-of-Thought reasoning is crucial for successful low-rate actions.
Abstract
Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to "defect": misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt the GPT-5, Claude-4.5 and Qwen-3 families to take a target action at low probabilities (e.g. 0.01%), either given directly or requiring derivation, and evaluate their calibration (i.e. whether they perform the target action roughly 1 in 10,000 times when resampling). We find that frontier models are surprisingly good at this task. If there is a source of entropy in-context (such as a UUID), they maintain high calibration at rates lower than 1 in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Explainable Artificial Intelligence (XAI)
