Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos

Tuyen Tran; Thao Minh Le; Quang-Hung Le; Truyen Tran

arXiv:2508.07330·cs.CV·August 19, 2025

Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos

Tuyen Tran, Thao Minh Le, Quang-Hung Le, Truyen Tran

PDF

Open Access

TL;DR

Planner-Refiner is a novel framework that iteratively refines space-time visual representations guided by language to improve video-language alignment, especially for complex prompts, demonstrated on new and existing benchmarks.

Contribution

It introduces a Planner-Refiner framework that decomposes complex language prompts and refines visual features iteratively for better alignment in videos.

Findings

01

Outperforms state-of-the-art on Referring Video Object Segmentation.

02

Achieves superior results on Temporal Grounding with complex prompts.

03

Introduces the MeViS-X benchmark for long query evaluation.

Abstract

Vision-language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner-Refiner, a framework to overcome these challenges. Planner-Refiner bridges the semantic gap by iteratively refining visual elements' space-time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence, a noun-phrase and verb-phrase pair, to direct visual tokens' self-attention across space then time, achieving efficient single-step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task-specific heads for alignment generation. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques