UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

Jiabing Yang; Yixiang Chen; Yuan Xu; Peiyan Li; Xiangnan Wu; Zichen Wen; Bowen Fang; Tao Yu; Zhengbo Zhang; Yingda Li; Kai Wang; Jing Liu; Nianfeng Liu; Yan Huang; Liang Wang

arXiv:2602.18020·cs.CV·February 23, 2026

UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

Jiabing Yang, Yixiang Chen, Yuan Xu, Peiyan Li, Xiangnan Wu, Zichen Wen, Bowen Fang, Tao Yu, Zhengbo Zhang, Yingda Li, Kai Wang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang

PDF

Open Access 3 Reviews

TL;DR

UAOR is a training-free, plug-and-play module that improves vision-language-action models by reinjecting observation information based on uncertainty, enhancing action confidence without extra data or training.

Contribution

Proposes UAOR, a novel uncertainty-aware observation reinjection method that enhances VLA models without additional data, training, or modules.

Findings

01

Consistently improves diverse VLA models in simulation and real-world tasks

02

Eliminates need for extra observation cues or modules

03

Operates with minimal computational overhead

Abstract

Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as "key-value memory", we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

- The paper is fairly well-written and easy to understand. - The problem studied is important.

Weaknesses

- The core premise of the paper is not well-supported. The author claims that: "Our key intuition is that after ingesting the observation, the model tends to progressively “forget” during forward inference" and back this up by Figure 1, where the **early** layers of the VLA experiences a mild increase in action uncertainty. This is neither convincing nor well-explained. To me, Fig. 1 is actually quite reasonable: early in the computation, there's more uncertainty in the action distribution but a

Reviewer 02Rating 4Confidence 5

Strengths

S1. I think the problem addressed by this paper is very important, as overly deep LLM layers do tend to ignore certain visual information. S2. The paper proposes a simple yet efficient method to alleviate this issue.

Weaknesses

W1. My main concern lies in whether using Action Token Entropy to measure visual uncertainty is reasonable. W2. Pretrained VLA models typically generate actions only from the final layer, without utilizing or supervising intermediate features for action prediction. Therefore, the observed layer-wise “action” token entropy may result from the training paradigm itself rather than reflecting the actual dynamics of feature changes within the model. I suggest that the authors finetune the VLA model

Reviewer 03Rating 6Confidence 3

Strengths

1. Demonstrates a strong understanding of the current limitations of VLA models. 2. Attempts to address the identified problems in a general and systematic way. 3. Proposes the concept of action entropy and applies it effectively in the forward process. The accompanying theoretical analysis is solid and convincing. 4. Provides a well-designed ablation study to examine the effects of observation injection.

Weaknesses

1. The explanation of why forgetting leads to uncertainty lacks a clear reasoning process. 2. Theorems 3.1–3.4 appear largely independent and not well integrated. It would be better to unify them into a holistic framework or justify that they represent complementary perspectives on the same problem. 3. Real-world experiments are only conducted using Open-VLA, neglecting other baseline models such as CogACT. 4. The relationship between α and γ should be jointly analyzed, as variations in one m

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Explainable Artificial Intelligence (XAI)