LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLMs
Xiaodong Wang, Jinfa Huang, Li Yuan, Peixi Peng

TL;DR
This paper introduces LeanPO, a novel preference optimization method for Video-LLMs that addresses likelihood displacement issues, improves alignment with human preferences, and enhances model performance with minimal overhead.
Contribution
LeanPO reformulates reward estimation for Video-LLMs, incorporating self-generated preference data and dynamic label smoothing to improve alignment and mitigate likelihood drop issues.
Findings
Significantly improves Video-LLM performance across various models.
Effectively mitigates likelihood displacement during training.
Enhances alignment with human trustworthiness in Video-LLMs.
Abstract
Most Video Large Language Models (Video-LLMs) adopt preference alignment techniques, e.g., DPO~\citep{rafailov2024dpo}, to optimize the reward margin between a winning response () and a losing response (). However, the likelihood displacement observed in DPO indicates that both and often decrease during training, inadvertently boosting the probabilities of non-target responses. In this paper, we systematically revisit this phenomenon from LLMs to Video-LLMs, showing that it intensifies when dealing with the redundant complexity of video content. To alleviate the impact of this phenomenon, we propose \emph{Lean Preference Optimization} (LeanPO), a reference-free approach that reformulates the implicit reward as the average likelihood of the response with respect to the policy model. A key component of LeanPO is the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsDirect Preference Optimization · Label Smoothing · ADaptive gradient method with the OPTimal convergence rate
