ROSA: Harnessing Robot States for Vision-Language and Action Alignment
Yuqing Wen, Kefan Gu, Haoxuan Liu, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiaoyan Sun

TL;DR
ROSA introduces a novel training paradigm that leverages robot state estimation to better align vision-language models with robotic actions, improving efficiency and generalization in robotic control tasks.
Contribution
ROSA is the first approach to incorporate robot state estimation for aligning vision-language and action spaces, addressing data inefficiency and spatial-temporal gaps in existing methods.
Findings
Enhanced performance in simulated environments.
Improved generalization in real-world tasks.
Effective in low-data regimes.
Abstract
Vision-Language-Action (VLA) models have recently made significant advance in multi-task, end-to-end robotic control, due to the strong generalization capabilities of Vision-Language Models (VLMs). A fundamental challenge in developing such models is effectively aligning the vision-language space with the robotic action space. Existing approaches typically rely on directly fine-tuning VLMs using expert demonstrations. However, this strategy suffers from a spatio-temporal gap, resulting in considerable data inefficiency and heavy reliance on human labor. Spatially, VLMs operate within a high-level semantic space, whereas robotic actions are grounded in low-level 3D physical space; temporally, VLMs primarily interpret the present, while VLA models anticipate future actions. To overcome these challenges, we propose a novel training paradigm, ROSA, which leverages robot state estimation to…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Clear and focused reframing of the VLA alignment problem via explicit robot state estimation as an auxiliary task. 2. Method is well-grounded: the spatial-temporal gap motivation is compelling and supported by analyses (e.g., linear probing). 3. The writing is good and easy-to-follow.
1. The paper notes that too much state data can degrade action prediction, but the mitigation strategy is limited to a fixed mixing ratio. Deeper analysis on curriculum, sampling strategies, or domain-adaptive weighting could clarify best practices. 2. The setup assumes fixed camera-robot extrinsics. Mobile or cross-camera settings remain open. 3. Evaluation experiments on LIBERO or CALVIN are missing, which are two commonly used VLA evaluation benchmarks. 4. The state estimation as an auxili
1. The proposed ROSA demonstrates data efficiency. With limited training data, the model can leverage estimated robot proprioceptive states to better understand scene information. 2. ROSA shows significant improvements over baseline models across multiple RLBench simulation environments and real-world tasks, highlighting its enhanced 3D spatial perception capability focused on robot state.
1. The paper only introduces a method for enhancing robot proprioceptive state prediction without delving into how this enhancement affects model representation and reasoning capabilities. The proposed automated data collection paradigm is relatively simplistic, and the overall contribution and novelty appear limited. The authors could further investigate the impact on model representation abilities. 2. While experiments are conducted in both simulated and real-world environments, the simulation
- The core idea of decomposing the VLM-to-VLA alignment problem into two sub-tasks—predicting the current state and predicting the future action—is a novel contribution. This reframing provides a clear mechanism to directly tackle the identified spatial gap by grounding the model in its own physical embodiment. - The paper is well-written. The motivation is articulated with great clarity, and the core concept of the spatio-temporal gap is well illustrated. - The authors validate the general appl
- While the paper strongly justifies how the state estimation task bridges the spatial gap, the argument for bridging the temporal gap is less direct. The auxiliary task focuses on estimating the current state ("What is the current state of the robot?"). While a better understanding of the present undoubtedly aids in predicting the future, the mechanism does not explicitly train the model on temporal dynamics or future forecasting beyond what the standard expert demonstrations already provide. -
* The method effectively addresses the spatio-temporal misalignment between VLMs and VLAs through a novel integration of robot state estimation. * The auxiliary task design enhances the model’s spatial reasoning and action prediction capabilities without requiring extensive human annotation. * Extensive experiments across simulation and real-world environments validate the robustness and generalization ability of ROSA, particularly in low-data regimes.
While ROSA demonstrates compelling advantages in low-data regimes, its scalability and sustained effectiveness in large-scale data scenarios remain unclear. As the volume of expert demonstrations increases, the relative contribution of the robot state estimation task may diminish. The core weakness is not that ROSA performs worse, but that its unique value proposition appears to weaken. One would expect a truly robust method to maintain a more consistent performance gap, demonstrating that the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Robotics and Automated Systems · Multimodal Machine Learning Applications
