Decision-Aware Uncertainty Evaluation of Vision-Language Model-Based Early Action Anticipation for Human-Robot Interaction

Zhaoda Du; Michael Bowman; Qiaojie Zheng; Xiaoli Zhang

arXiv:2603.10061·cs.RO·March 13, 2026

Decision-Aware Uncertainty Evaluation of Vision-Language Model-Based Early Action Anticipation for Human-Robot Interaction

Zhaoda Du, Michael Bowman, Qiaojie Zheng, Xiaoli Zhang

PDF

Open Access

TL;DR

This paper systematically evaluates the uncertainty calibration of vision-language models in early action recognition for human-robot interaction, addressing safety and reliability in ambiguous, partial observations.

Contribution

It introduces a temporal-prefix evaluation protocol and metrics for assessing uncertainty calibration in vision-language models for action recognition.

Findings

01

Identifies miscalibration patterns under partial observations.

02

Provides reliability evidence for confidence-based human-robot interaction.

03

Characterizes failure modes in early action prediction.

Abstract

Robots in shared workspaces must interpret human actions from partial, ambiguous observations, where overconfident early predictions can lead to unsafe or disruptive interaction. This challenge is amplified in egocentric views, where viewpoint changes and occlusions increase perceptual noise and ambiguity. As a result, downstream human-robot interaction modules require not only an action hypothesis but also a trustworthy estimate of confidence under partial observation. Recent vision-language model-based approaches have been proposed for short-term action recognition due to their open-vocabulary and context-aware reasoning, but their uncertainty reliability in the temporal-prefix regime is largely uncharacterized. We present the first systematic evaluation of uncertainty in vision-language model-based short-term action recognition for human-robot interaction. We introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robot Manipulation and Learning