Probabilistic Medical Predictions of Large Language Models
Bowen Gu, Rishi J. Desai, Kueiyu Joshua Lin, Jie Yang

TL;DR
This paper evaluates the reliability of probability estimates from large language models in medical predictions, finding that implicit probabilities outperform explicit ones, especially in smaller models and imbalanced datasets.
Contribution
It provides a comparative analysis of explicit versus implicit probability estimates in LLMs for clinical predictions, highlighting their limitations and areas for improvement.
Findings
Implicit probabilities outperform explicit probabilities in key metrics.
Smaller LLMs and imbalanced datasets exacerbate probability estimation issues.
Explicit prompts often lead to unreliable probability estimates.
Abstract
Large Language Models (LLMs) have shown promise in clinical applications through prompt engineering, allowing flexible clinical predictions. However, they struggle to produce reliable prediction probabilities, which are crucial for transparency and decision-making. While explicit prompts can lead LLMs to generate probability estimates, their numerical reasoning limitations raise concerns about reliability. We compared explicit probabilities from text generation to implicit probabilities derived from the likelihood of predicting the correct label token. Across six advanced open-source LLMs and five medical datasets, explicit probabilities consistently underperformed implicit probabilities in discrimination, precision, and recall. This discrepancy is more pronounced with smaller LLMs and imbalanced datasets, highlighting the need for cautious interpretation, improved probability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Topic Modeling
