LAMP: Extracting Local Decision Surfaces From Large Language Models
Ryan Chen, Youngmin Ko, Zeyu Zhang, Catherine Cho, Sunny Chung, Mauro Giuffr\'e, Dennis L. Shung, Bradly C. Stadie

TL;DR
LAMP is a lightweight method that approximates local decision surfaces of language models using their self-reported explanations, aiding in model interpretability and auditing without needing internal access.
Contribution
It introduces LAMP, a novel approach that maps model explanations to decision surfaces, enabling practical auditing of proprietary models without internal gradient or activation data.
Findings
LAMP's local linear approximations align with human judgments of explanation quality.
LAMP effectively audits models on sentiment, controversy, and safety tasks.
It provides consistent assessments with expert evaluations on clinical data.
Abstract
We introduce LAMP (Local Attribution Mapping Probe), a method that shines light onto a black-box language model's decision surface and studies how reliably a model maps its stated reasons to its reported predictions by approximating a decision surface. LAMP treats the model's own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model's output. By doing so, it reveals how much the stated factors steer the model's decisions. We apply LAMP to three tasks: sentiment analysis, controversial-topic detection, and safety-prompt auditing. Across these tasks, LAMP reveals that many language models' locally approximated linear decision landscapes overall agree with human judgments on explanation quality and, on a clinical case-file data set, align with expert assessments. Since LAMP operates without requiring access to model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
