Explaining black box text modules in natural language with language models
Chandan Singh, Aliyah R. Hsu, Richard Antonello, Shailee Jain,, Alexander G. Huth, Bin Yu, Jianfeng Gao

TL;DR
This paper introduces SASC, a method for automatically generating natural language explanations for black box text modules, including neural network components and brain regions, enhancing interpretability across AI and neuroscience.
Contribution
The paper presents SASC, a novel approach that provides natural language explanations and reliability scores for black box text modules, applicable to models and brain data.
Findings
SASC often recovers ground truth explanations in synthetic modules.
It enables inspection of internal modules within BERT.
It can generate explanations for individual fMRI voxels' responses.
Abstract
Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A "text module" is any function that maps text to a scalar continuous value, such as a submodule within an LLM or a fitted model of a brain region. "Black box" indicates that we only have access to the module's inputs/outputs. We introduce Summarize and Score (SASC), a method that takes in a text module and returns a natural language explanation of the module's selectivity along with a score for how reliable the explanation is. We study SASC in 3 contexts. First, we evaluate SASC on synthetic modules and find that it often recovers ground truth explanations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Explainable Artificial Intelligence (XAI) · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · WordPiece · Adam · Linear Warmup With Linear Decay · Softmax · Layer Normalization · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?
