Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress
Yuelin Zhang, Sijie Cheng, Chen Li, Zongzhao Li, Yuxin Huang, Yang Liu, Wenbing Huang

TL;DR
This paper introduces R^2VLM, a recurrent reasoning vision-language model that efficiently estimates progress in long-horizon embodied tasks by iteratively processing video snippets and maintaining a global context through a Chain of Thought.
Contribution
The paper presents a novel recurrent reasoning framework for vision-language models that effectively handles long video trajectories and complex temporal dependencies in task progress estimation.
Findings
Achieves state-of-the-art performance in long-horizon progress estimation.
Demonstrates strong generalization across various downstream applications.
Efficiently processes long videos without high computational costs.
Abstract
Accurately estimating task progress is critical for embodied agents to plan and execute long-horizon, multi-step tasks. Despite promising advances, existing Vision-Language Models (VLMs) based methods primarily leverage their video understanding capabilities, while neglecting their complex reasoning potential. Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real-world deployment. To address these challenges, we propose the Recurrent Reasoning Vision-Language Model (VLM). Our model features a recurrent reasoning framework that processes local video snippets iteratively, maintaining a global context through an evolving Chain of Thought (CoT). This CoT explicitly records task decomposition, key steps, and their completion status, enabling the model to reason about complex temporal dependencies. This design avoids the high cost of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Social Robot Interaction and HRI
