Decision-Aware Uncertainty Evaluation of Vision-Language Model-Based Early Action Anticipation for Human-Robot Interaction
Zhaoda Du, Michael Bowman, Qiaojie Zheng, Xiaoli Zhang

TL;DR
This paper systematically evaluates the uncertainty calibration of vision-language models in early action recognition for human-robot interaction, addressing safety and reliability in ambiguous, partial observations.
Contribution
It introduces a temporal-prefix evaluation protocol and metrics for assessing uncertainty calibration in vision-language models for action recognition.
Findings
Identifies miscalibration patterns under partial observations.
Provides reliability evidence for confidence-based human-robot interaction.
Characterizes failure modes in early action prediction.
Abstract
Robots in shared workspaces must interpret human actions from partial, ambiguous observations, where overconfident early predictions can lead to unsafe or disruptive interaction. This challenge is amplified in egocentric views, where viewpoint changes and occlusions increase perceptual noise and ambiguity. As a result, downstream human-robot interaction modules require not only an action hypothesis but also a trustworthy estimate of confidence under partial observation. Recent vision-language model-based approaches have been proposed for short-term action recognition due to their open-vocabulary and context-aware reasoning, but their uncertainty reliability in the temporal-prefix regime is largely uncharacterized. We present the first systematic evaluation of uncertainty in vision-language model-based short-term action recognition for human-robot interaction. We introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robot Manipulation and Learning
