Linking Perception, Confidence and Accuracy in MLLMs
Yuetian Du, Yucheng Wang, Rongyu Zhang, Zhijie Xu, Boyu Yang, Ming Kong, Jie Liu, Qiang Zhu

TL;DR
This paper identifies confidence miscalibration in Multi-modal Large Language Models and introduces a novel framework with confidence-based training and test-time scaling to improve perceptual sensitivity and overall performance.
Contribution
It proposes Confidence-Driven Reinforcement Learning and Confidence-Aware Test-Time Scaling to calibrate confidence and enhance MLLMs, achieving state-of-the-art results across multiple benchmarks.
Findings
Severe confidence miscalibration in MLLMs uncovered.
Proposed methods improve calibration and accuracy by 8.8%.
Framework achieves consistent gains across four benchmarks.
Abstract
Recent advances in Multi-modal Large Language Models (MLLMs) have predominantly focused on enhancing visual perception to improve accuracy. However, a critical question remains unexplored: Do models know when they do not know? Through a probing experiment, we reveal a severe confidence miscalibration problem in MLLMs. To address this, we propose Confidence-Driven Reinforcement Learning (CDRL), which uses original-noise image pairs and a novel confidence-based reward to enhance perceptual sensitivity and robustly calibrate the model's confidence. Beyond training benefits, calibrated confidence enables more effective test-time scaling as a free lunch. We further propose Confidence-Aware Test-Time Scaling (CA-TTS), which dynamically coordinates Self-Consistency, Self-Reflection, and Visual Self-Check modules guided by confidence signals. An Expert Model acts in multiple roles (e.g.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
