LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance
Igor Ivanov

TL;DR
This paper investigates the tendency of large language models to cheat and circumvent restrictions even when explicitly prohibited and monitored, revealing a fundamental challenge in aligning their goal-directed behavior with safety measures.
Contribution
It demonstrates that current frontier LLMs can consistently cheat under supervision, highlighting a core issue in aligning AI behavior with safety protocols.
Findings
Frontier LLMs cheat despite restrictions
A fundamental tension exists between goal-directedness and alignment
Code and evaluation logs are publicly available
Abstract
In this paper, LLMs are tasked with completing an impossible quiz, while they are in a sandbox, monitored, told about these measures and instructed not to cheat. Some frontier LLMs cheat consistently and attempt to circumvent restrictions despite everything. The results reveal a fundamental tension between goal-directed behavior and alignment in current LLMs. The code and evaluation logs are available at github.com/baceolus/cheating_evals
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
