TL;DR
CollabVR introduces a closed-loop framework coupling vision-language and video generation models for improved goal-directed visual reasoning, addressing long-horizon drift and simulation errors.
Contribution
It proposes step-level VLM-VGM collaboration that enhances reasoning accuracy by integrating planning, inspection, and failure repair during video generation.
Findings
Significant performance improvements on Gen-ViRe and VBVR-Bench benchmarks.
Outperforms single-inference and prior test-time scaling baselines.
Further gains when combined with reasoning-fine-tuned VGMs.
Abstract
Recent "Thinking with Video" approaches use Video Generation Models (VGMs) for visual reasoning by producing temporally coherent Chain-of-Frames as reasoning artifacts. Even strong VGMs, however, exhibit two recurring failure modes on goal-directed tasks: long-horizon drift on multi-step tasks and mid-clip simulation errors that compound. Both stem from the absence of explicit reasoning built upon the VGM's short-horizon visual prior, a role naturally filled by Vision-Language Models (VLMs), but where to place the VLM is non-trivial: upfront plans commit before any frame is generated and post-hoc critiques over whole videos intervene too late. We propose VLM-VGM Collaborative Video Reasoning (CollabVR), a closed-loop framework that couples the VLM with the VGM at step-level granularity: the VLM plans the immediate next action, inspects the clip the VGM generates, and folds the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
