Confidence Calibration in Vision-Language-Action Models

Thomas P Zollo; Richard Zemel

arXiv:2507.17383·cs.RO·December 23, 2025

Confidence Calibration in Vision-Language-Action Models

Thomas P Zollo, Richard Zemel

PDF

Open Access 3 Reviews

TL;DR

This paper investigates confidence calibration in vision-language-action models for robotics, establishing baselines, analyzing calibration over time, and proposing lightweight methods to improve trustworthiness in robot decision-making.

Contribution

It introduces the first study on confidence calibration in VLAs, providing baseline metrics, analyzing calibration dynamics, and proposing prompt ensembles and action-wise Platt scaling for miscalibration correction.

Findings

01

Calibration error correlates with task success

02

Calibration evolves over time during task execution

03

Prompt ensembles and Platt scaling improve calibration accuracy

Abstract

Trustworthy robot behavior requires not only high levels of task success but also that the robot can reliably quantify how likely it is to succeed. To this end, we present a first-of-its-kind study of confidence calibration in vision-language-action (VLA) foundation models, which map visual observations and natural language instructions to low-level robot motor commands. We establish a confidence baseline for VLAs, examine how task success relates to calibration error and how calibration evolves over time, and introduce two lightweight techniques to remedy the miscalibration we observe: prompt ensembles and action-wise Platt scaling. Our aim in this study is to begin to develop the tools and conceptual understanding necessary to render VLAs both highly performant and highly trustworthy via reliable uncertainty quantification.

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- The paper is well-motivated. The problem of confidence calibration is highly relevant for deploying trustworthy robotic systems in high-stakes, real-world environments

Weaknesses

- The paper's technical novelty is limited. The proposed methods are largely simple applications of existing techniques to an existing VLA. ### The claim of a systematic study is not well-supported by the experiments: - All experiments are conducted on a single VLA, OpenVLA. This is not enough to claim a systematic study of calibration errors for VLAs. Diffusion-based VLAs might exhibit different calibration behavior. - All experiments are conducted in the LIBERO simulation. For a study on this

Reviewer 02Rating 6Confidence 2

Strengths

The paper is clear, well-structured, and grounded in the calibration literature. The experimental design is sound, using multiple metrics (ECE, Brier, NLL) and analyzing calibration across tasks, time, and model precision. The results are consistent and interpretable. For example, better-performing models are also better calibrated, and early overconfidence decreases over time. The proposed fixes are simple but effective and practical for real-world systems.

Weaknesses

- Methods are empirical adaptations rather than theoretical advances. - All experiments are simulation-only; no real-robot validation. More generally, it is not clear what role robotics plays in this framework. - Only one model family (OpenVLA) and one benchmark were tested. - Missing comparisons to other post-hoc calibration baselines. - Discussion of broader implications (safety, planning) could be deeper. - The evaluation is entirely simulation-based and limited to OpenVLA on LIBERO tasks, w

Reviewer 03Rating 4Confidence 3

Strengths

- Originality: From my understanding of the literature, there are no previous works that studied the problem of calibration in VLA models. The authors focus on this problem, which has not been previously studied. - Quality: Overall, the paper is decently presented with high-quality results for calibration scores in the OpenVLA setting. Looking at the code, it seems to be cleaned up well and is transparently released for reproducibility. - Clarity: Overall, the paper is clear, but needs improveme

Weaknesses

- The authors claim to have a comprehensive evaluation, but this paper focuses exclusively on OpenVLA. Over the past year or so, various other VLA models have been made open-source, so evaluating on other open-source VLA models (Octo, Pi-0, etc) would significantly strengthen the claims in this work. - As the authors acknowledge, all evaluation is performed in simulation, leaving out other sources of potential confidence miscalibration, including sensor noise, etc.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications