SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning

Jisheng Dang; Yizhou Zhang; Hao Ye; Teng Wang; Siming Chen; Huicheng Zheng; Yulan Guo; Jianhuang Lai; Bin Hu

arXiv:2506.00835·cs.AI·March 24, 2026

SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning

Jisheng Dang, Yizhou Zhang, Hao Ye, Teng Wang, Siming Chen, Huicheng Zheng, Yulan Guo, Jianhuang Lai, Bin Hu

PDF

Open Access

TL;DR

This paper introduces SynPO, a new preference optimization method that improves fine-grained video captioning by balancing detailed description generation and training efficiency, outperforming existing methods.

Contribution

We propose SynPO, a novel optimization approach that enhances video captioning models by preventing negative preferences from dominating and maintaining language capabilities, with improved training efficiency.

Findings

01

SynPO outperforms DPO variants on video captioning benchmarks.

02

SynPO achieves 20% higher training efficiency.

03

SynPO generalizes well across NLP tasks and models.

Abstract

Fine-grained video captioning aims to generate detailed, temporally coherent descriptions of video content. However, existing methods struggle to capture subtle video dynamics and rich detailed information. In this paper, we leverage preference learning to enhance the performance of vision-language models in fine-grained video captioning, while mitigating several limitations inherent to direct preference optimization (DPO). First, we propose a pipeline for constructing preference pairs that leverages the intrinsic properties of VLMs along with partial assistance from large language models, achieving an optimal balance between cost and data quality. Second, we propose Synergistic Preference Optimization (SynPO), a novel optimization method offering significant advantages over DPO and its variants. SynPO prevents negative preferences from dominating the optimization, explicitly preserves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Subtitles and Audiovisual Media