Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven   Text-to-Image Generation

Yufan Zhou; Ruiyi Zhang; Kaizhi Zheng; Nanxuan Zhao; Jiuxiang Gu,; Zichao Wang; Xin Eric Wang; Tong Sun

arXiv:2406.09305·cs.CV·August 9, 2024

Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation

Yufan Zhou, Ruiyi Zhang, Kaizhi Zheng, Nanxuan Zhao, Jiuxiang Gu,, Zichao Wang, Xin Eric Wang, Tong Sun

PDF

Open Access

TL;DR

This paper introduces Toffee, a cost-effective method for constructing large-scale subject-driven text-to-image datasets without subject-level fine-tuning, enabling high-quality image generation and editing with significantly reduced GPU hours.

Contribution

We propose Toffee, a novel dataset construction approach that generates millions of high-quality image pairs without fine-tuning, vastly reducing costs and enabling large-scale subject-driven image tasks.

Findings

01

Constructed a 5 million image pair dataset, five times larger than previous datasets.

02

Achieved comparable results in subject-driven image editing and generation using our dataset.

03

Reduced dataset construction cost by tens of thousands of GPU hours.

Abstract

In subject-driven text-to-image generation, recent works have achieved superior performance by training the model on synthetic datasets containing numerous image pairs. Trained on these datasets, generative models can produce text-aligned images for specific subject from arbitrary testing image in a zero-shot manner. They even outperform methods which require additional fine-tuning on testing images. However, the cost of creating such datasets is prohibitive for most researchers. To generate a single training pair, current methods fine-tune a pre-trained text-to-image model on the subject image to capture fine-grained details, then use the fine-tuned model to create images for the same subject based on creative text prompts. Consequently, constructing a large-scale dataset with millions of subjects can require hundreds of thousands of GPU hours. To tackle this problem, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Image Retrieval and Classification Techniques