Direct Preference Optimization of Video Large Multimodal Models from   Language Model Reward

Ruohong Zhang; Liangke Gui; Zhiqing Sun; Yihao Feng; Keyang Xu,; Yuanhan Zhang; Di Fu; Chunyuan Li; Alexander Hauptmann; Yonatan Bisk; and; Yiming Yang

arXiv:2404.01258·cs.CV·April 3, 2024·1 cites

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu,, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, and, Yiming Yang

PDF

Open Access 1 Repo 3 Models 2 Datasets 1 Video

TL;DR

This paper presents a novel framework that uses detailed video captions as proxies for video content, enabling large multimodal models to better assess factuality and improve performance on video question answering tasks.

Contribution

It introduces a new method leveraging video captions as evidence, aligning with GPT-4V's reward system, to enhance preference optimization in video multimodal models.

Findings

01

Improved alignment with GPT-4V reward mechanism.

02

Enhanced performance on video QA tasks.

03

Effective use of video captions as content proxies.

Abstract

Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of large language model (LLM). However, in tasks involving video instruction-following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge. Previous studies have explored using large large multimodal models (LMMs) as reward models to guide preference modeling, but their ability to accurately assess the factuality of generated responses compared to corresponding videos has not been conclusively established. This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content, enabling language models to incorporate this information as supporting evidence for scoring video Question Answering (QA) predictions. Our approach demonstrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

riflezhang/llava-hound-dpo
pytorchOfficial

Models

Datasets

Videos

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward· underline

Taxonomy

TopicsEducational and Technological Research · Computational and Text Analysis Methods · Multimodal Machine Learning Applications

MethodsDirect Preference Optimization