Is This Loss Informative? Faster Text-to-Image Customization by Tracking Objective Dynamics
Anton Voronov, Mikhail Khoroshikh, Artem Babenko, Max Ryabinin

TL;DR
This paper introduces a simple early stopping criterion based on the training objective to accelerate text-to-image model personalization, achieving up to 8x faster adaptation without quality loss.
Contribution
The authors propose a novel, easy-to-implement early stopping method that tracks objective dynamics to speed up personalization of large text-to-image models.
Findings
Up to 8 times faster adaptation with no quality loss.
Most concepts are learned early, and standard metrics fail to indicate convergence.
The method is effective across multiple concepts and personalization techniques.
Abstract
Text-to-image generation models represent the next step of evolution in image synthesis, offering a natural way to achieve flexible yet fine-grained control over the result. One emerging area of research is the fast adaptation of large text-to-image models to smaller datasets or new visual concepts. However, many efficient methods of adaptation have a long training time, which limits their practical applications, slows down experiments, and spends excessive GPU resources. In this work, we study the training dynamics of popular text-to-image personalization methods (such as Textual Inversion or DreamBooth), aiming to speed them up. We observe that most concepts are learned at early stages and do not improve in quality later, but standard training convergence metrics fail to indicate that. Instead, we propose a simple drop-in early stopping criterion that only requires computing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Advanced Image and Video Retrieval Techniques
MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Early Stopping
