Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos
Tuyen Tran, Thao Minh Le, Quang-Hung Le, Truyen Tran

TL;DR
Planner-Refiner is a novel framework that iteratively refines space-time visual representations guided by language to improve video-language alignment, especially for complex prompts, demonstrated on new and existing benchmarks.
Contribution
It introduces a Planner-Refiner framework that decomposes complex language prompts and refines visual features iteratively for better alignment in videos.
Findings
Outperforms state-of-the-art on Referring Video Object Segmentation.
Achieves superior results on Temporal Grounding with complex prompts.
Introduces the MeViS-X benchmark for long query evaluation.
Abstract
Vision-language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner-Refiner, a framework to overcome these challenges. Planner-Refiner bridges the semantic gap by iteratively refining visual elements' space-time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence, a noun-phrase and verb-phrase pair, to direct visual tokens' self-attention across space then time, achieving efficient single-step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task-specific heads for alignment generation. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
