Learning to Rank Caption Chains for Video-Text Alignment
Ansel Blume, Burak Uzkent, Shalini Chaudhuri, Garin Kessler

TL;DR
This paper proposes a ranking-based optimization method for improving video-text alignment in language models, demonstrating its advantages over binary preference methods and emphasizing the need for vision encoder finetuning.
Contribution
It introduces a ranking optimization approach for video-text alignment and shows its effectiveness over traditional binary DPO, highlighting the importance of vision encoder finetuning.
Findings
Ranking optimization outperforms binary DPO for long-form content.
Generating challenging caption chains improves model training.
Finetuning the vision encoder is crucial for effectiveness.
Abstract
Direct preference optimization (DPO) is an effective technique to train language models to generate preferred over dispreferred responses. However, this binary "winner-takes-all" approach is suboptimal for vision-language models whose response quality is highly dependent on visual content. In particular, a response may still be faithful to the visual inputs even if it is less preferable than an alternative. The standard Bradley-Terry DPO formulation lacks this nuance, upweighting winning responses without sufficient regard for whether the "losing" response still maintains high visual fidelity. In this work, we investigate ranking optimization as an alternative that more precisely situates responses' faithfulness to visual inputs. We focus on video-text alignment using detailed video captions, proposing a method to generate challenging, totally ordered caption chains at scale through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques
