Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

John Ray B. Martinez

arXiv:2603.24481·cs.AI·March 26, 2026

Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

John Ray B. Martinez

PDF

Open Access

TL;DR

This paper introduces a multi-agent framework with consistency verification that significantly improves uncertainty calibration in medical MCQA, enhancing AI reliability for clinical decision-making.

Contribution

It presents a novel multi-agent approach with Two-Phase Verification and S-Score Weighted Fusion to enhance calibration and accuracy in medical question answering.

Findings

01

Calibration improved by 49-74% across datasets

02

Full system achieves ECE = 0.091 and AUROC = 0.630

03

Two-Phase Verification is key to calibration gains

Abstract

Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis is then subjected to a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100-question and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Clinical Reasoning and Diagnostic Skills