Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation

Zhiyao Ren; Yibing Zhan; Siyuan Liang; Guozheng Ma; Baosheng Yu; Dacheng Tao

arXiv:2601.15645·cs.CL·January 23, 2026

Towards Reliable Medical LLMs: Benchmarking and Enhancing Confidence Estimation of Large Language Models in Medical Consultation

Zhiyao Ren, Yibing Zhan, Siyuan Liang, Guozheng Ma, Baosheng Yu, Dacheng Tao

PDF

Open Access

TL;DR

This paper introduces a benchmark and a new framework, MedConf, to improve confidence estimation in large language models during multi-turn medical consultations, enhancing reliability and interpretability.

Contribution

It presents the first benchmark for multi-turn confidence assessment in medical LLMs and proposes MedConf, a novel evidence-grounded confidence estimation framework.

Findings

01

MedConf outperforms state-of-the-art methods on multiple metrics.

02

Medical data challenges existing confidence estimation methods.

03

Information sufficiency critically impacts confidence reliability.

Abstract

Large-scale language models (LLMs) often offer clinical judgments based on incomplete information, increasing the risk of misdiagnosis. Existing studies have primarily evaluated confidence in single-turn, static settings, overlooking the coupling between confidence and correctness as clinical evidence accumulates during real consultations, which limits their support for reliable decision-making. We propose the first benchmark for assessing confidence in multi-turn interaction during realistic medical consultations. Our benchmark unifies three types of medical data for open-ended diagnostic generation and introduces an information sufficiency gradient to characterize the confidence-correctness dynamics as evidence increases. We implement and compare 27 representative methods on this benchmark; two key insights emerge: (1) medical data amplifies the inherent limitations of token-level and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling