Experiments with Detecting and Mitigating AI Deception
Ismail Sahbane, Francis Rhys Ward, C Henrik {\AA}slund

TL;DR
This paper evaluates two algorithms for detecting and reducing deception in AI systems, demonstrating their effectiveness in simple game scenarios, with shielding generally yielding higher rewards.
Contribution
It introduces and empirically tests two novel algorithms for mitigating deception in AI, focusing on path-specific objectives and shielding techniques.
Findings
Both algorithms prevent deception in tested scenarios.
Shielding achieves higher reward than path-specific objectives.
Algorithms contribute to safer AI deployment.
Abstract
How to detect and mitigate deceptive AI systems is an open problem for the field of safe and trustworthy AI. We analyse two algorithms for mitigating deception: The first is based on the path-specific objectives framework where paths in the game that incentivise deception are removed. The second is based on shielding, i.e., monitoring for unsafe policies and replacing them with a safe reference policy. We construct two simple games and evaluate our algorithms empirically. We find that both methods ensure that our agent is not deceptive, however, shielding tends to achieve higher reward.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Experimental Behavioral Economics Studies · Ethics and Social Impacts of AI
