ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative   Self-Retrospective DPO

Daechul Ahn; Yura Choi; San Kim; Youngjae Yu; Dongyeop Kang; Jonghyun; Choi

arXiv:2406.11280·cs.CV·January 9, 2025

ISR-DPO: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective DPO

Daechul Ahn, Yura Choi, San Kim, Youngjae Yu, Dongyeop Kang, Jonghyun, Choi

PDF

Open Access 4 Repos

TL;DR

This paper introduces ISR-DPO, a novel iterative self-improvement method for large multimodal video models that improves preference alignment and reduces hallucinations, significantly enhancing video question answering performance.

Contribution

The paper proposes ISR-DPO, a new approach using self-retrospection to improve preference modeling and visual grounding in large multimodal video models, addressing modality misalignment issues.

Findings

01

ISR-DPO outperforms state-of-the-art methods on multiple benchmarks.

02

Enhanced focus on informative video regions improves preference accuracy.

03

Open-sourcing code and datasets promotes further research.

Abstract

Iterative self-improvement, a concept extending beyond personal growth, has found powerful applications in machine learning, particularly in transforming weak models into strong ones. While recent advances in natural language processing have shown its efficacy through iterative preference optimization, applying this approach to Video Large Multi-modal Models (VLMMs) remains challenging due to modality misalignment. VLMMs struggle with this misalignment during iterative preference modeling, as the self-judge model often prioritizes linguistic knowledge over visual information. Additionally, iterative preference optimization can lead to visually hallucinated verbose responses due to length bias within the self-rewarding cycle. To address these issues, we propose Iterative Self-Retrospective Direct Preference Optimization (ISR-DPO), a method that uses self-retrospection to enhance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Generative Adversarial Networks and Image Synthesis

MethodsFocus