Surrogate modeling for interpreting black-box LLMs in medical predictions
Changho Han, Songsoo Kim, Dong Won Kim, Leo Anthony Celi, Jaewoong Kim, SungA Bae, Dukyong Yoon

TL;DR
This paper introduces a surrogate modeling framework to interpret and analyze the knowledge encoded in large language models, especially in medical prediction tasks, revealing biases and inaccuracies.
Contribution
The authors develop a quantitative surrogate modeling approach that explains LLM-encoded knowledge and uncovers biases and inaccuracies in medical predictions.
Findings
Revealed LLMs' associations contradicting established medical knowledge.
Detected persistent racial biases in LLM-encoded knowledge.
Demonstrated the framework's effectiveness in explaining LLM perceptions.
Abstract
Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
