HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion   Models

Zhonghao Wang; Wei Wei; Yang Zhao; Zhisheng Xiao; Mark; Hasegawa-Johnson; Humphrey Shi; Tingbo Hou

arXiv:2312.00079·cs.CV·December 4, 2023·1 cites

HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

Zhonghao Wang, Wei Wei, Yang Zhao, Zhisheng Xiao, Mark, Hasegawa-Johnson, Humphrey Shi, Tingbo Hou

PDF

Open Access

TL;DR

HiFi Tuner introduces a novel, efficient fine-tuning method for diffusion models that significantly improves personalized image generation fidelity and enables subject substitution through text, outperforming previous approaches.

Contribution

The paper presents HiFi Tuner, a parameter-efficient fine-tuning framework with novel techniques like mask guidance and reference-guided generation to enhance subject fidelity in personalized diffusion-based image synthesis.

Findings

01

Improves CLIP-T score by 3.6 points over Textual Inversion

02

Enhances DINO score by 9.6 points over Textual Inversion

03

Sets new state-of-the-art results on DreamBooth dataset

Abstract

This paper explores advancements in high-fidelity personalized image generation through the utilization of pre-trained text-to-image diffusion models. While previous approaches have made significant strides in generating versatile scenes based on text descriptions and a few input images, challenges persist in maintaining the subject fidelity within the generated images. In this work, we introduce an innovative algorithm named HiFi Tuner to enhance the appearance preservation of objects during personalized image generation. Our proposed method employs a parameter-efficient fine-tuning framework, comprising a denoising process and a pivotal inversion process. Key enhancements include the utilization of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations to elevate the sample fidelity. Additionally, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Image Retrieval and Classification Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Linear Layer · Residual Connection · Dense Connections · Layer Normalization · Vision Transformer · Diffusion · self-DIstillation with NO labels