Propositional Interpretability in Artificial Intelligence
David J. Chalmers

TL;DR
This paper advocates for propositional interpretability in AI, emphasizing the importance of understanding AI mechanisms through propositional attitudes, and discusses challenges like thought logging and evaluates current interpretability methods.
Contribution
It introduces propositional interpretability as a key framework for AI explanation and analyzes existing methods and philosophical approaches for their effectiveness in this context.
Findings
Propositional attitudes are central to human and AI interpretability.
Thought logging is a critical challenge for propositional interpretability.
Current interpretability methods have specific strengths and weaknesses.
Abstract
Mechanistic interpretability is the program of explaining what AI systems are doing in terms of their internal mechanisms. I analyze some aspects of the program, along with setting out some concrete challenges and assessing progress to date. I argue for the importance of propositional interpretability, which involves interpreting a system's mechanisms and behavior in terms of propositional attitudes: attitudes (such as belief, desire, or subjective probability) to propositions (e.g. the proposition that it is hot outside). Propositional attitudes are the central way that we interpret and explain human beings and they are likely to be central in AI too. A central challenge is what I call thought logging: creating systems that log all of the relevant propositional attitudes in an AI system over time. I examine currently popular methods of interpretability (such as probing, sparse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI)
