Temporal Preference Optimization for Long-Form Video Understanding

Rui Li; Xiaohan Wang; Yuhui Zhang; Orr Zohar; Zeyu Wang; Serena Yeung-Levy

arXiv:2501.13919·cs.CV·September 3, 2025

Temporal Preference Optimization for Long-Form Video Understanding

Rui Li, Xiaohan Wang, Yuhui Zhang, Orr Zohar, Zeyu Wang, Serena Yeung-Levy

PDF

Open Access 2 Models 1 Datasets

TL;DR

This paper introduces Temporal Preference Optimization (TPO), a post-training framework that improves the temporal grounding ability of large multimodal video models by leveraging preference learning on curated datasets, without extensive manual annotations.

Contribution

The paper presents a novel TPO framework that enhances long-form video understanding by optimizing models with preference datasets at multiple temporal granularities, reducing annotation needs.

Findings

01

TPO significantly improves temporal grounding accuracy.

02

LLaVA-Video-TPO outperforms existing models on Video-MME benchmark.

03

TPO is effective across multiple state-of-the-art video-LMMs.

Abstract

Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

ruili0/LongVA-TPO-10k
dataset· 609 dl
609 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Advanced Data Compression Techniques