Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

Yuelin Zhang; Sijie Cheng; Chen Li; Zongzhao Li; Yuxin Huang; Yang Liu; Wenbing Huang

arXiv:2603.17312·cs.CV·March 19, 2026

Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress

Yuelin Zhang, Sijie Cheng, Chen Li, Zongzhao Li, Yuxin Huang, Yang Liu, Wenbing Huang

PDF

Open Access

TL;DR

This paper introduces R^2VLM, a recurrent reasoning vision-language model that efficiently estimates progress in long-horizon embodied tasks by iteratively processing video snippets and maintaining a global context through a Chain of Thought.

Contribution

The paper presents a novel recurrent reasoning framework for vision-language models that effectively handles long video trajectories and complex temporal dependencies in task progress estimation.

Findings

01

Achieves state-of-the-art performance in long-horizon progress estimation.

02

Demonstrates strong generalization across various downstream applications.

03

Efficiently processes long videos without high computational costs.

Abstract

Accurately estimating task progress is critical for embodied agents to plan and execute long-horizon, multi-step tasks. Despite promising advances, existing Vision-Language Models (VLMs) based methods primarily leverage their video understanding capabilities, while neglecting their complex reasoning potential. Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real-world deployment. To address these challenges, we propose the Recurrent Reasoning Vision-Language Model ( $R^{2}$ VLM). Our model features a recurrent reasoning framework that processes local video snippets iteratively, maintaining a global context through an evolving Chain of Thought (CoT). This CoT explicitly records task decomposition, key steps, and their completion status, enabling the model to reason about complex temporal dependencies. This design avoids the high cost of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Social Robot Interaction and HRI