Experiments with Detecting and Mitigating AI Deception

Ismail Sahbane; Francis Rhys Ward; C Henrik {\AA}slund

arXiv:2306.14816·cs.AI·June 27, 2023

Experiments with Detecting and Mitigating AI Deception

Ismail Sahbane, Francis Rhys Ward, C Henrik {\AA}slund

PDF

Open Access

TL;DR

This paper evaluates two algorithms for detecting and reducing deception in AI systems, demonstrating their effectiveness in simple game scenarios, with shielding generally yielding higher rewards.

Contribution

It introduces and empirically tests two novel algorithms for mitigating deception in AI, focusing on path-specific objectives and shielding techniques.

Findings

01

Both algorithms prevent deception in tested scenarios.

02

Shielding achieves higher reward than path-specific objectives.

03

Algorithms contribute to safer AI deployment.

Abstract

How to detect and mitigate deceptive AI systems is an open problem for the field of safe and trustworthy AI. We analyse two algorithms for mitigating deception: The first is based on the path-specific objectives framework where paths in the game that incentivise deception are removed. The second is based on shielding, i.e., monitoring for unsafe policies and replacing them with a safe reference policy. We construct two simple games and evaluate our algorithms empirically. We find that both methods ensure that our agent is not deceptive, however, shielding tends to achieve higher reward.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Experimental Behavioral Economics Studies · Ethics and Social Impacts of AI