Factorized Learning for Temporally Grounded Video-Language Models

Wenzheng Zeng; Difei Gao; Mike Zheng Shou; Hwee Tou Ng

arXiv:2512.24097·cs.CV·January 1, 2026

Factorized Learning for Temporally Grounded Video-Language Models

Wenzheng Zeng, Difei Gao, Mike Zheng Shou, Hwee Tou Ng

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces a factorized learning framework for video-language models that decouples temporal grounding and textual response tasks, improving event-level perception accuracy through a novel paradigm and optimization algorithm.

Contribution

It proposes D$^2$VLM, a decoupled framework with evidence tokens and a new FPO algorithm for explicit temporal grounding, addressing limitations of coupled task handling.

Findings

01

Enhanced temporal grounding accuracy in experiments.

02

Improved textual response reliability.

03

Effective learning with synthetic dataset.

Abstract

Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D $^{2}$ VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
wenzhengzeng/D2VLM-Models
model

Datasets

wenzhengzeng/D2VLM-Dataset
dataset· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition