Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation
Xun Wu, Shaohan Huang, Furu Wei

TL;DR
This paper introduces VisionPrefer, a high-quality, multi-aspect preference dataset created using multimodal large language models, which improves text-to-image alignment and generalizes better than previous metrics.
Contribution
The paper presents VisionPrefer, a novel AI-annotated preference dataset capturing multiple aspects, and demonstrates its effectiveness in enhancing text-to-image generative model alignment.
Findings
VP-Score achieves human-level preference prediction accuracy.
VisionPrefer improves compositional image generation across multiple aspects.
Synthetic AI-generated data enhances model alignment with human preferences.
Abstract
Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, enhancing the alignment between generated images and textual prompts. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges and promote the alignment of generative models through instruction tuning, we leverage multimodal large language models to create VisionPrefer, a high-quality and fine-grained preference dataset that captures multiple preference aspects. We aggregate feedback from AI annotators across four aspects: prompt-following, aesthetic, fidelity, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
