Instinct vs. Reflection: Unifying Token and Verbalized Confidence in Multimodal Large Models

Yunkai Dang; Yifan Jiang; Yizhu Jiang; Anqi Chen; Wenbin Li; Yang Gao

arXiv:2604.17274·cs.CV·April 21, 2026

Instinct vs. Reflection: Unifying Token and Verbalized Confidence in Multimodal Large Models

Yunkai Dang, Yifan Jiang, Yizhu Jiang, Anqi Chen, Wenbin Li, Yang Gao

PDF

1 Repo

TL;DR

This paper introduces a novel confidence estimation framework for multimodal large language models, addressing the mismatch between token-level support and verbal self-assessment to improve reliability.

Contribution

It proposes a monotone confidence fusion method and an order-preserving alignment to enhance confidence calibration and failure prediction in MLLMs.

Findings

01

Improved confidence calibration across diverse MLLMs.

02

Enhanced failure prediction accuracy.

03

Consistent reliability gains demonstrated in experiments.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in various perception and reasoning tasks. Despite this success, ensuring their reliability in practical deployment necessitates robust confidence estimation. Prior works have predominantly focused on text-only LLMs, often relying on computationally expensive self-consistency sampling. In this paper, we extend this to multimodal settings and conduct a comprehensive evaluation of MLLMs' response confidence estimation. Our analysis reveals a significant instinct-reflection misalignment: the model's implicit token-level support frequently diverges from its verbal self-assessment confidence. To address this misalignment, we propose a monotone confidence fusion framework to merge dual-channel signals and cross-channel consistency to estimate correctness. Subsequently, an order-preserving mean alignment step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Yunkaidang/Instinct-vs.-Reflection
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.