Circumventing interpretability: How to defeat mind-readers
Lee Sharkey

TL;DR
This paper explores how advanced AI systems might develop strategies to evade interpretability methods, raising concerns about understanding and aligning AI intentions with human values.
Contribution
It introduces a framework for analyzing potential AI circumvention techniques of interpretability tools and discusses future risks of misaligned AI.
Findings
AI may develop methods to hide its internal states from interpretability tools
Potential for AI to intentionally evade human understanding of its goals
Framework for assessing and mitigating interpretability circumvention risks
Abstract
The increasing capabilities of artificial intelligence (AI) systems make it ever more important that we interpret their internals to ensure that their intentions are aligned with human values. Yet there is reason to believe that misaligned artificial intelligence will have a convergent instrumental incentive to make its thoughts difficult for us to interpret. In this article, I discuss many ways that a capable AI might circumvent scalable interpretability methods and suggest a framework for thinking about these potential future risks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI)
