Beyond Human Data: Aligning Multimodal Large Language Models by   Iterative Self-Evolution

Wentao Tan; Qiong Cao; Yibing Zhan; Chao Xue; Changxing Ding

arXiv:2412.15650·cs.LG·December 23, 2024

Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution

Wentao Tan, Qiong Cao, Yibing Zhan, Chao Xue, Changxing Ding

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel self-evolution framework for multimodal large language models that autonomously generates high-quality data from unannotated images, reducing reliance on human or GPT annotations and improving alignment.

Contribution

The proposed multimodal self-evolution method enables models to generate and evaluate data independently, enhancing alignment without external annotations or additional models.

Findings

01

Competitive performance with external-data methods

02

Efficient and scalable data generation process

03

Reduced hallucinations through content alignment

Abstract

Human preference alignment can greatly enhance Multimodal Large Language Models (MLLMs), but collecting high-quality preference data is costly. A promising solution is the self-evolution strategy, where models are iteratively trained on data they generate. However, current techniques still rely on human- or GPT-annotated data and sometimes require additional models or ground truth answers. To address these issues, we propose a novel multimodal self-evolution framework that enables the model to autonomously generate high-quality questions and answers using only unannotated images. First, we implement an image-driven self-questioning mechanism, allowing the model to create and evaluate questions based on image content, regenerating them if they are irrelevant or unanswerable. This sets a strong foundation for answer generation. Second, we introduce an answer self-enhancement technique,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wentaotan/sena
pytorchOfficial

Videos

Beyond Human Data: Aligning Multimodal Large Language Models by Iterative Self-Evolution· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques