Ethical Implications of Training Deceptive AI
Jason Starace, Bert Baumgaertner, Terence Soule

TL;DR
This paper introduces the Deception Research Levels (DRL) framework to classify and govern deceptive AI research based on risk, aiming to fill governance gaps and promote safe, ethical development of deceptive AI capabilities.
Contribution
The paper proposes a novel DRL framework that classifies deceptive AI research by risk profile, grounded in ethical principles, and provides guidelines for safe research practices.
Findings
Applied the framework to eight case studies demonstrating its utility.
Identified ecological validity as a key indicator of risk level.
Established safeguards ranging from documentation to third-party audits.
Abstract
Deceptive behavior in AI systems is no longer theoretical: large language models strategically mislead without producing false statements, maintain deceptive strategies through safety training, and coordinate deception in multi-agent settings. While the European Union's AI Act prohibits deployment of deceptive AI systems, it explicitly exempts research and development, creating a necessary but unstructured space in which no established framework governs how deception research should be conducted or how risk should scale with capability. This paper proposes a Deception Research Levels (DRL) framework, a classification system for deceptive algorithm research modeled on the Biosafety Level system used in biological research. The DRL framework classifies research by risk profile rather than researcher intent, assessing deceptive mechanisms across five dimensions grounded in the AI4People…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
