Multimodal Large Language Model is a Human-Aligned Annotator for   Text-to-Image Generation

Xun Wu; Shaohan Huang; Furu Wei

arXiv:2404.15100·cs.CV·April 24, 2024

Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

Xun Wu, Shaohan Huang, Furu Wei

PDF

Open Access

TL;DR

This paper introduces VisionPrefer, a high-quality, multi-aspect preference dataset created using multimodal large language models, which improves text-to-image alignment and generalizes better than previous metrics.

Contribution

The paper presents VisionPrefer, a novel AI-annotated preference dataset capturing multiple aspects, and demonstrates its effectiveness in enhancing text-to-image generative model alignment.

Findings

01

VP-Score achieves human-level preference prediction accuracy.

02

VisionPrefer improves compositional image generation across multiple aspects.

03

Synthetic AI-generated data enhances model alignment with human preferences.

Abstract

Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, enhancing the alignment between generated images and textual prompts. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges and promote the alignment of generative models through instruction tuning, we leverage multimodal large language models to create VisionPrefer, a high-quality and fine-grained preference dataset that captures multiple preference aspects. We aggregate feedback from AI annotators across four aspects: prompt-following, aesthetic, fidelity, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques