TL;DR
ConfProBench is a new benchmark that systematically evaluates the reliability of confidence scores produced by multimodal large language model process judges, highlighting their limitations and guiding future improvements.
Contribution
It introduces the first comprehensive benchmark with adversarial perturbations and novel metrics to assess the robustness, sensitivity, and calibration of MPJ confidence scores.
Findings
Current MPJs show limitations in confidence reliability.
Benchmark provides a standardized way to evaluate confidence robustness.
Experiments with 14 models establish baseline performance and reveal weaknesses.
Abstract
Reasoning is a critical capability of multimodal large language models (MLLMs) for solving complex multimodal tasks, and judging the correctness of reasoning steps is crucial for improving this capability. Recently, MLLM-based process judges (MPJs) have been widely used to assess the correctness of reasoning steps in multimodal tasks. Therefore, evaluating MPJs is important for identifying their limitations and guiding future improvements. However, existing benchmarks for MPJs mainly focus on tasks such as step correctness classification and reasoning process search, while overlooking a key aspect: whether the confidence scores produced by MPJs at the step level are reliable. To address this gap, we propose ConfProBench, the first comprehensive benchmark designed to systematically evaluate the reliability of step-level confidence scores generated by MPJs. Our benchmark constructs three…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The focus on step-level confidence is relatively new and could provide more information and supervision in building more advanced reasoning chains. 2. The proposed three metrics cover a whole spectrum of aspects, from robustness, to sensitivity and calibration. 3. Extensive experiments were conducted on both open-source and proprietary models to show how they perform on the proposed ConfProBench benchmark with the proposed metrics. 4. Writing is good and easy to follow.
1. The data construction process seems heavily rely on MLLMs themselves such as GPT-4o to generate different perturbations. While it automate the data curation process and make it more scalable, it is a bottleneck and capped by the capability of the models (i.e., GPT-4o). Should we get a new version of the dataset every time a more advanced model come out? 2. The verbalized confidence used in the draft might be more prompt-dependent than more intrinsic methods (e.g., logit-based). 3. For Table 2
- The paper addresses an underexplored aspect of multimodal process judges (MPJs), i.e., their step-level confidence reliability. The proposed framework introduces three complementary metrics that assess robustness, sensitivity, and calibration. - The benchmark incorporates adversarial variants at the lexical, syntactic, and multimodal levels, which are constructed to preserve semantic meaning while effectively stressing the model’s confidence robustness. - The analysis provides valuable insi
- The lexical and syntactic perturbations are generated using GPT-4o, which may inadvertently advantage OpenAI models during evaluation. The paper should explicitly discuss this potential bias and clarify whether additional models or cross-validation methods were used to mitigate it. - The Data Quality Control section lacks essential information, such as the number and expertise of annotators, inter-annotator agreement scores, and rejection or revision rates. - The paper reports calibration met
1. The paper looks at an interesting aspect of the variance when using LLMs as judges. This directly affects how good a judge LLM can be. 2. The paper contains an extensive supplementary material, demonstrating that the authors have put work into the manuscript.
1. The definition of MPJ output as written in ln.119-120 is very ambiguous. In particular, I'm skeptical whether asking an LLM to output a probability score has any meaning or consistency over different tries. Is there a standard implemented for the evaluation of correctness? Otherwise, I'm not convinced that LLM, or even humans are able to give consistent answers. 2. I feel the proposed metrics miss an important aspect: accuracy of the LLM's discrimination. It seems that the proposed metrics
1. The paper's contribution lies in successfully shifting the academic focus from the "classification performance" of MPJs to their "confidence reliability". In certain domains, a model must not only make correct judgments but also have an accurate self-awareness of its confidence. ConfProBench is the first work to provide a systematic evaluation framework for this problem, making it a pioneering work. 2. It proposes three novel evaluation metrics: CRS, CSS, and CCS. These metrics approach confi
1. The paper's core motivation is that the confidence reliability of MPJs is crucial for downstream tasks like reasoning chain optimization, automatic error correction. However, the entire evaluation is confined to intrinsic metrics. A more persuasive argument would involve extrinsic evaluation: demonstrating that an MPJ with higher scores on ConfProBench actually leads to better final performance when its confidence scores are used to guide a practical downstream task. 2. ConfProBench is entire
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
