ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO
Daechul Ahn, Yura Choi, San Kim, Youngjae Yu, Dongyeop Kang, Jonghyun, Choi

TL;DR
This paper introduces ISR-DPO, a novel iterative self-improvement method for large multimodal video models that improves preference alignment and reduces hallucinations, significantly enhancing video question answering performance.
Contribution
The paper proposes ISR-DPO, a new approach using self-retrospection to improve preference modeling and visual grounding in large multimodal video models, addressing modality misalignment issues.
Findings
ISR-DPO outperforms state-of-the-art methods on multiple benchmarks.
Enhanced focus on informative video regions improves preference accuracy.
Open-sourcing code and datasets promotes further research.
Abstract
Iterative self-improvement, a concept extending beyond personal growth, has found powerful applications in machine learning, particularly in transforming weak models into strong ones. While recent advances in natural language processing have shown its efficacy through iterative preference optimization, applying this approach to Video Large Multi-modal Models (VLMMs) remains challenging due to modality misalignment. VLMMs struggle with this misalignment during iterative preference modeling, as the self-judge model often prioritizes linguistic knowledge over visual information. Additionally, iterative preference optimization can lead to visually hallucinated verbose responses due to length bias within the self-rewarding cycle. To address these issues, we propose Iterative Self-Retrospective Direct Preference Optimization (ISR-DPO), a method that uses self-retrospection to enhance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Generative Adversarial Networks and Image Synthesis
MethodsFocus
