HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models
Zhonghao Wang, Wei Wei, Yang Zhao, Zhisheng Xiao, Mark, Hasegawa-Johnson, Humphrey Shi, Tingbo Hou

TL;DR
HiFi Tuner introduces a novel, efficient fine-tuning method for diffusion models that significantly improves personalized image generation fidelity and enables subject substitution through text, outperforming previous approaches.
Contribution
The paper presents HiFi Tuner, a parameter-efficient fine-tuning framework with novel techniques like mask guidance and reference-guided generation to enhance subject fidelity in personalized diffusion-based image synthesis.
Findings
Improves CLIP-T score by 3.6 points over Textual Inversion
Enhances DINO score by 9.6 points over Textual Inversion
Sets new state-of-the-art results on DreamBooth dataset
Abstract
This paper explores advancements in high-fidelity personalized image generation through the utilization of pre-trained text-to-image diffusion models. While previous approaches have made significant strides in generating versatile scenes based on text descriptions and a few input images, challenges persist in maintaining the subject fidelity within the generated images. In this work, we introduce an innovative algorithm named HiFi Tuner to enhance the appearance preservation of objects during personalized image generation. Our proposed method employs a parameter-efficient fine-tuning framework, comprising a denoising process and a pivotal inversion process. Key enhancements include the utilization of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations to elevate the sample fidelity. Additionally, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Image Retrieval and Classification Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Linear Layer · Residual Connection · Dense Connections · Layer Normalization · Vision Transformer · Diffusion · self-DIstillation with NO labels
