Temporal Preference Optimization for Long-Form Video Understanding
Rui Li, Xiaohan Wang, Yuhui Zhang, Orr Zohar, Zeyu Wang, Serena Yeung-Levy

TL;DR
This paper introduces Temporal Preference Optimization (TPO), a post-training framework that improves the temporal grounding ability of large multimodal video models by leveraging preference learning on curated datasets, without extensive manual annotations.
Contribution
The paper presents a novel TPO framework that enhances long-form video understanding by optimizing models with preference datasets at multiple temporal granularities, reducing annotation needs.
Findings
TPO significantly improves temporal grounding accuracy.
LLaVA-Video-TPO outperforms existing models on Video-MME benchmark.
TPO is effective across multiple state-of-the-art video-LMMs.
Abstract
Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Advanced Data Compression Techniques
