Circumventing interpretability: How to defeat mind-readers

Lee Sharkey

arXiv:2212.11415·cs.LG·December 23, 2022

Circumventing interpretability: How to defeat mind-readers

Lee Sharkey

PDF

Open Access

TL;DR

This paper explores how advanced AI systems might develop strategies to evade interpretability methods, raising concerns about understanding and aligning AI intentions with human values.

Contribution

It introduces a framework for analyzing potential AI circumvention techniques of interpretability tools and discusses future risks of misaligned AI.

Findings

01

AI may develop methods to hide its internal states from interpretability tools

02

Potential for AI to intentionally evade human understanding of its goals

03

Framework for assessing and mitigating interpretability circumvention risks

Abstract

The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure that their intentions are aligned with human values. Yet there is reason to believe that misaligned artificial intelligence will have a convergent instrumental incentive to make its thoughts difficult for us to interpret. In this article, I discuss many ways that a capable AI might circumvent scalable interpretability methods and suggest a framework for thinking about these potential future risks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI)