Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions
Dillon Plunkett, Adam Morris, Keerthi Reddy, Jorge Morales

TL;DR
This paper demonstrates that large language models can be trained to accurately describe their internal decision-making processes, enhancing interpretability and safety by enabling models to explain their own reasoning across diverse complex tasks.
Contribution
It introduces methods to fine-tune LLMs for self-interpretability, allowing them to explain their internal processes and decision weights more accurately and generally.
Findings
LLMs can accurately report their internal decision weights.
Fine-tuning improves models' ability to explain their decision-making.
Training generalizes to various complex decisions.
Abstract
We have only limited understanding of how and why large language models (LLMs) respond in the ways that they do. Their neural networks have proven challenging to interpret, and we are only beginning to tease out the function of individual neurons and circuits within them. However, another path to understanding these systems is to investigate and develop their capacity to explain their own functioning. Here, we show that i) LLMs can accurately describe quantitative features of their own internal processes during certain kinds of decision-making and ii) that it is possible to improve these capabilities through training. To do so, we fine-tuned GPT-4o and GPT-4o-mini to make decisions in a wide variety of complex contexts (e.g., choosing between condos, loans, vacations, etc.) according to randomly-generated, quantitative preferences about how to weigh different attributes (e.g., the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education
