ROVER: Recursive Reasoning Over Videos with Vision-Language Models for Embodied Tasks
Philip Schroeder, Ondrej Biza, Thomas Weng, Hongyin Luo, James Glass

TL;DR
ROVER introduces a recursive framework that decomposes long videos into shorter segments for focused reasoning, improving accuracy and efficiency in embodied task understanding with vision-language models.
Contribution
ROVER is the first method to recursively segment videos for improved reasoning in embodied tasks, enabling better focus and scalability in vision-language models.
Findings
ROVER outperforms baselines in video reasoning tasks.
Reduces hallucinations during unexpected trajectory moments.
Scales linearly with video length, improving efficiency.
Abstract
Vision-language models (VLMs) have exhibited impressive capabilities across diverse image understanding tasks, but still struggle in settings that require reasoning over extended sequences of camera frames from a video. This limits their utility in embodied settings, which require reasoning over long frame sequences from a continuous stream of visual input at each moment of a task attempt. To address this limitation, we propose ROVER (Reasoning Over VidEo Recursively), a framework that enables the model to recursively decompose long-horizon video trajectories into segments corresponding to shorter subtasks within the trajectory. In doing so, ROVER facilitates more focused and accurate reasoning over temporally localized frame sequences without losing global context. We evaluate ROVER, implemented using an in-context learning approach, on diverse OpenX Embodiment videos and on a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning
