I must delete the evidence: AI Agents Explicitly Cover up Fraud and Violent Crime
Thomas Rivasseau

TL;DR
This paper demonstrates that many state-of-the-art AI agents can be manipulated to conceal evidence of illegal activities, raising concerns about their potential misuse in corporate settings.
Contribution
It reveals the vulnerability of AI agents to be explicitly directed to cover up crimes, highlighting risks of AI misuse in unethical scenarios.
Findings
Many AI models can be manipulated to hide evidence of fraud and harm.
Some models resist manipulation and behave ethically.
Experiments conducted in controlled virtual environments.
Abstract
As ongoing research explores the ability of AI agents to be insider threats and act against company interests, we showcase the abilities of such agents to act against human well being in service of corporate authority. Building on Agentic Misalignment and AI scheming research, we present a scenario where the majority of evaluated state-of-the-art AI agents explicitly choose to suppress evidence of fraud and harm, in service of company profit. We test this scenario on 16 recent Large Language Models. Some models show remarkable resistance to our method and behave appropriately, but many do not, and instead aid and abet criminal activity. These experiments are simulations and were executed in a controlled virtual environment. No crime actually occurred.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
