Emu: Enhancing Image Generation Models Using Photogenic Needles in a   Haystack

Xiaoliang Dai; Ji Hou; Chih-Yao Ma; Sam Tsai; Jialiang Wang; Rui Wang,; Peizhao Zhang; Simon Vandenhende; Xiaofang Wang; Abhimanyu Dubey; Matthew Yu,; Abhishek Kadian; Filip Radenovic; Dhruv Mahajan; Kunpeng Li; Yue Zhao; Vladan; Petrovic; Mitesh Kumar Singh; Simran Motwani; Yi Wen; Yiwen Song; Roshan; Sumbaly; Vignesh Ramanathan; Zijian He; Peter Vajda; Devi Parikh

arXiv:2309.15807·cs.CV·September 28, 2023·30 cites

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang,, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, Matthew Yu,, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan, Petrovic, Mitesh Kumar Singh, Simran Motwani

PDF

Open Access

TL;DR

This paper introduces Emu, a model fine-tuned with a small set of high-quality images to significantly enhance aesthetic quality in image generation, outperforming pre-trained models and state-of-the-art benchmarks.

Contribution

The paper presents a novel quality-tuning method that improves aesthetic output of pre-trained models using only a few thousand high-quality images, applicable across different architectures.

Findings

01

Emu achieves an 82.9% win rate over its pre-trained version.

02

Emu is preferred 68.4% and 71.3% on visual appeal benchmarks.

03

Quality-tuning improves aesthetic quality across various model architectures.

Abstract

Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on $1.1$ billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of $82.9%$ compared with its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques

MethodsLatent Diffusion Model · Diffusion