Surrogate modeling for interpreting black-box LLMs in medical predictions

Changho Han; Songsoo Kim; Dong Won Kim; Leo Anthony Celi; Jaewoong Kim; SungA Bae; Dukyong Yoon

arXiv:2604.20331·cs.CL·April 24, 2026

Surrogate modeling for interpreting black-box LLMs in medical predictions

Changho Han, Songsoo Kim, Dong Won Kim, Leo Anthony Celi, Jaewoong Kim, SungA Bae, Dukyong Yoon

PDF

TL;DR

This paper introduces a surrogate modeling framework to interpret and analyze the knowledge encoded in large language models, especially in medical prediction tasks, revealing biases and inaccuracies.

Contribution

The authors develop a quantitative surrogate modeling approach that explains LLM-encoded knowledge and uncovers biases and inaccuracies in medical predictions.

Findings

01

Revealed LLMs' associations contradicting established medical knowledge.

02

Detected persistent racial biases in LLM-encoded knowledge.

03

Demonstrated the framework's effectiveness in explaining LLM perceptions.

Abstract

Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.