Factorized Learning for Temporally Grounded Video-Language Models
Wenzheng Zeng, Difei Gao, Mike Zheng Shou, Hwee Tou Ng

TL;DR
This paper introduces a factorized learning framework for video-language models that decouples temporal grounding and textual response tasks, improving event-level perception accuracy through a novel paradigm and optimization algorithm.
Contribution
It proposes D$^2$VLM, a decoupled framework with evidence tokens and a new FPO algorithm for explicit temporal grounding, addressing limitations of coupled task handling.
Findings
Enhanced temporal grounding accuracy in experiments.
Improved textual response reliability.
Effective learning with synthetic dataset.
Abstract
Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose DVLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
