Because we have LLMs, we Can and Should Pursue Agentic Interpretability
Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord

TL;DR
This paper explores agentic interpretability enabled by LLMs, where models proactively assist human understanding through multi-turn conversations, fostering better mental models of AI systems.
Contribution
It introduces the concept of agentic interpretability, highlighting its potential, challenges, and the need for new evaluation methods in human-AI collaborative understanding.
Findings
Agentic interpretability leverages LLMs to teach and explain beyond traditional methods.
It enables humans to develop better mental models of AI systems.
Challenges include evaluation due to human-in-the-loop complexity.
Abstract
The era of Large Language Models (LLMs) presents a new opportunity for interpretability--agentic interpretability: a multi-turn conversation with an LLM wherein the LLM proactively assists human understanding by developing and leveraging a mental model of the user, which in turn enables humans to develop better mental models of the LLM. Such conversation is a new capability that traditional `inspective' interpretability methods (opening the black-box) do not use. Having a language model that aims to teach and explain--beyond just knowing how to talk--is similar to a teacher whose goal is to teach well, understanding that their success will be measured by the student's comprehension. While agentic interpretability may trade off completeness for interactivity, making it less suitable for high-stakes safety situations with potentially deceptive models, it leverages a cooperative model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Natural Language Processing Techniques · European and International Law Studies
