Confidence Calibration in Vision-Language-Action Models
Thomas P Zollo, Richard Zemel

TL;DR
This paper investigates confidence calibration in vision-language-action models for robotics, establishing baselines, analyzing calibration over time, and proposing lightweight methods to improve trustworthiness in robot decision-making.
Contribution
It introduces the first study on confidence calibration in VLAs, providing baseline metrics, analyzing calibration dynamics, and proposing prompt ensembles and action-wise Platt scaling for miscalibration correction.
Findings
Calibration error correlates with task success
Calibration evolves over time during task execution
Prompt ensembles and Platt scaling improve calibration accuracy
Abstract
Trustworthy robot behavior requires not only high levels of task success but also that the robot can reliably quantify how likely it is to succeed. To this end, we present a first-of-its-kind study of confidence calibration in vision-language-action (VLA) foundation models, which map visual observations and natural language instructions to low-level robot motor commands. We establish a confidence baseline for VLAs, examine how task success relates to calibration error and how calibration evolves over time, and introduce two lightweight techniques to remedy the miscalibration we observe: prompt ensembles and action-wise Platt scaling. Our aim in this study is to begin to develop the tools and conceptual understanding necessary to render VLAs both highly performant and highly trustworthy via reliable uncertainty quantification.
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is well-motivated. The problem of confidence calibration is highly relevant for deploying trustworthy robotic systems in high-stakes, real-world environments
- The paper's technical novelty is limited. The proposed methods are largely simple applications of existing techniques to an existing VLA. ### The claim of a systematic study is not well-supported by the experiments: - All experiments are conducted on a single VLA, OpenVLA. This is not enough to claim a systematic study of calibration errors for VLAs. Diffusion-based VLAs might exhibit different calibration behavior. - All experiments are conducted in the LIBERO simulation. For a study on this
The paper is clear, well-structured, and grounded in the calibration literature. The experimental design is sound, using multiple metrics (ECE, Brier, NLL) and analyzing calibration across tasks, time, and model precision. The results are consistent and interpretable. For example, better-performing models are also better calibrated, and early overconfidence decreases over time. The proposed fixes are simple but effective and practical for real-world systems.
- Methods are empirical adaptations rather than theoretical advances. - All experiments are simulation-only; no real-robot validation. More generally, it is not clear what role robotics plays in this framework. - Only one model family (OpenVLA) and one benchmark were tested. - Missing comparisons to other post-hoc calibration baselines. - Discussion of broader implications (safety, planning) could be deeper. - The evaluation is entirely simulation-based and limited to OpenVLA on LIBERO tasks, w
- Originality: From my understanding of the literature, there are no previous works that studied the problem of calibration in VLA models. The authors focus on this problem, which has not been previously studied. - Quality: Overall, the paper is decently presented with high-quality results for calibration scores in the OpenVLA setting. Looking at the code, it seems to be cleaned up well and is transparently released for reproducibility. - Clarity: Overall, the paper is clear, but needs improveme
- The authors claim to have a comprehensive evaluation, but this paper focuses exclusively on OpenVLA. Over the past year or so, various other VLA models have been made open-source, so evaluating on other open-source VLA models (Octo, Pi-0, etc) would significantly strengthen the claims in this work. - As the authors acknowledge, all evaluation is performed in simulation, leaving out other sources of potential confidence miscalibration, including sensor noise, etc.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
