Interpretable Probability Estimation with LLMs via Shapley Reconstruction
Yang Nan, Qihao Wen, Jiahao Wang, Pengfei He, Ravi Tandon, Yong Ge, Han Xu

TL;DR
This paper introduces PRISM, a framework that enhances the transparency and accuracy of probability estimates from Large Language Models by decomposing predictions into input factors using Shapley values, applicable across various domains.
Contribution
PRISM is a novel method that reconstructs LLM probability estimates with improved calibration and interpretability through Shapley value decomposition.
Findings
PRISM outperforms direct prompting in predictive accuracy.
PRISM provides transparent insights into factor contributions.
Applicable across finance, healthcare, and agriculture domains.
Abstract
Large Language Models (LLMs) demonstrate potential to estimate the probability of uncertain events, by leveraging their extensive knowledge and reasoning capabilities. This ability can be applied to support intelligent decision-making across diverse fields, such as financial forecasting and preventive healthcare. However, directly prompting LLMs for probability estimation faces significant challenges: their outputs are often noisy, and the underlying predicting process is opaque. In this paper, we propose PRISM: Probability Reconstruction via Shapley Measures, a framework that brings transparency and precision to LLM-based probability estimation. PRISM decomposes an LLM's prediction by quantifying the marginal contribution of each input factor using Shapley values. These factor-level contributions are then aggregated to reconstruct a calibrated final estimate. In our experiments, we…
Peer Reviews
Decision·Submitted to ICLR 2026
PRISM offers a transparent, end‑to‑end recipe: factor impacts are estimated explicitly, sum to the final logit, and explain the probability. On standard tabular tasks, PRISM is often competitive or superior to strong prompt baselines across two LLMs. Tabular‑PRISM is a practical engineering improvement that reduces query count while preserving interpretability. Visualizations of factor interactions provide diagnostics that typical prompting pipelines lack.
Scope is limited to zero‑shot binary classification; multi‑class outcomes are not evaluated, and the unstructured text studies are small. The method’s query cost scales with the number of factors and sampled contexts; even with batching, total evaluations may be non‑trivial for large‑scale deployment. The final probability depends on a chosen base logit $\phi_0$; while ranking metrics are unaffected, calibration and thresholded decisions may be sensitive, and a deeper analysis is warranted. The
- This paper propose a novel idea to transform "probability estimation" to "reconstruction of Shapley marginal contributions in logit space". This creatively grafts the classical additivity of Shapley values onto LLM “verbal” probabilities. It also introduces Tabular-PRISM with batched paired comparisons and reference-sample imputation, together with a reference-specific Shapley definition and a formal proposition guaranteeing additive reconstruction—advances at both the definitional and algorit
- The central premise of the paper is that a probability reconstructed from Shapley values, $P_{PRISM} = \sigma(\phi_0 + \sum \phi_i)$ 1, is more accurate and less "noisy" than an LLM's direct, holistic probability estimate22. This is a significant claim that lacks a strong theoretical foundation - The Tabular-PRISM variant, which is used for all the main benchmark experiments in Table 1, is critically dependent on the choice of a single "reference instance" $r$6. The paper defines the calculat
- PRISM’s main win is that the final probability is explicitly explained as base logit + per-factor Shapley contributions. - The method is motivated by the observation that LLMs are more reliable at pairwise comparisons than at absolute probability statements, so PRISM asks the model to do the former and only then reconstructs the latter. That is a sensible, model-aware design choice. - PRISM is run on both GPT-4.1-mini and Gemini-2.5-Pro with essentially the same recipe, and on tabular tasks f
- All experiments are binary and zero-shot. Multi-class is only discussed as a possible one-vs-all extension and few-shot is explicitly deferred because demonstrations confound attribution. So we don’t yet know whether PRISM still produces clean factor attributions once the prompt contains examples or multiple labels. - The main quantitative evidence is four tabular-ish binary tasks plus two small real-world case studies (apple prices, football matches). That’s not enough to claim broad generali
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education
