AVC-DPO: Aligned Video Captioning via Direct Preference Optimization
Jiyang Tang, Hengyi Li, Yifan Du, Wayne Xin Zhao

TL;DR
AVC-DPO is a post-training framework that aligns video captioning models with human preferences by emphasizing temporal and spatial details, significantly improving detailed captioning performance.
Contribution
The paper introduces AVC-DPO, a novel preference optimization method that enhances video captioning models' focus on human-relevant details through prompt-based training.
Findings
Achieved first place in the Video Detailed Captioning Challenge.
Significantly improved caption alignment with human preferences.
Demonstrated superior performance on the VDC benchmark.
Abstract
Although video multimodal large language models (video MLLMs) have achieved substantial progress in video captioning tasks, it remains challenging to adjust the focal emphasis of video captions according to human preferences. To address this limitation, we propose Aligned Video Captioning via Direct Preference Optimization (AVC-DPO), a post-training framework designed to enhance captioning capabilities in video MLLMs through preference alignment. Our approach designs enhanced prompts that specifically target temporal dynamics and spatial information-two key factors that humans care about when watching a video-thereby incorporating human-centric preferences. AVC-DPO leverages the same foundation model's caption generation responses under varied prompt conditions to conduct preference-aware training and caption alignment. Using this framework, we have achieved exceptional performance in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization
