TL;DR
DivControl introduces a decomposable framework for controllable image generation that enables zero-shot adaptation, improves fidelity, and reduces training costs by factorizing and disentangling model components.
Contribution
It proposes a novel SVD-based factorization and knowledge diversion method for unified, efficient, and scalable controllable image generation.
Findings
Achieves state-of-the-art controllability with 36.4× less training cost.
Demonstrates superior zero-shot and few-shot performance on unseen conditions.
Improves condition fidelity and training efficiency through representation alignment.
Abstract
Diffusion models have advanced from text-to-image (T2I) to image-to-image (I2I) generation by incorporating structured inputs such as depth maps, enabling fine-grained spatial control. However, existing methods either train separate models for each condition or rely on unified architectures with entangled representations, resulting in poor generalization and high adaptation costs for novel conditions. To this end, we propose DivControl, a decomposable pretraining framework for unified controllable generation and efficient adaptation. DivControl factorizes ControlNet via SVD into basic components-pairs of singular vectors-which are disentangled into condition-agnostic learngenes and condition-specific tailors through knowledge diversion during multi-condition training. Knowledge diversion is implemented via a dynamic gate that performs soft routing over tailors based on the semantics of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
