See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

Tingjun Dai; Mingfei Han; Tingwen Du; Zhiheng Liu; Zhihui Li; Salman Khan; Jun Yu; Xiaojun Chang

arXiv:2603.09292·cs.RO·March 11, 2026

See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

Tingjun Dai, Mingfei Han, Tingwen Du, Zhiheng Liu, Zhihui Li, Salman Khan, Jun Yu, Xiaojun Chang

PDF

Open Access

TL;DR

This paper introduces SPR, a progress-aware framework for robotic manipulation that uses a continuous cycle of seeing, planning, and rewinding to improve robustness and error recovery without extra training data.

Contribution

The paper presents a novel progress-aware vision-language-action model that grounds instructions into spatial subgoals and incorporates a cycle of monitoring, planning, and rewinding for robustness.

Findings

01

SPR outperforms MolmoAct by 5% on LIBERO benchmark.

02

SPR achieves state-of-the-art robustness on LIBERO-Plus with unseen instructions.

03

SPR demonstrates superior out-of-distribution robustness.

Abstract

Measurement of task progress through explicit, actionable milestones is critical for robust robotic manipulation. This progress awareness enables a model to ground its current task status, anticipate verifiable intermediate states, and detect and recover from failures when progress stalls. To embody this capability, we introduce See, Plan, Rewind (SPR), a progress-aware vision-language-action framework that dynamically grounds language instructions into a sequence of spatial subgoals. SPR operates through a continuous core cycle, Seeing the current state and upcoming milestone, Planning a trajectory towards the next 2D waypoint, and Rewinding to a recoverable state upon failure by monitoring progress against the expected sequence. This closed-loop approach enables robust error correction without requiring additional training data or auxiliary models. Extensive experiments demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics