ViscoNet: Bridging and Harmonizing Visual and Textual Conditioning for ControlNet
Soon Yau Cheong, Armin Mustafa, Andrew Gilbert

TL;DR
ViscoNet is a lightweight, novel architecture that effectively combines spatial and visual conditioning in text-to-image models, addressing mode collapse and enhancing versatility in human image generation tasks.
Contribution
Introduces ViscoNet, a one-branch-adapter architecture that preserves generative power while requiring fewer parameters and dataset size, and effectively addresses mode collapse.
Findings
Outperforms existing methods in visual-text harmony
Reduces training parameters and dataset requirements
Excels in diverse human image generation tasks
Abstract
This paper introduces ViscoNet, a novel one-branch-adapter architecture for concurrent spatial and visual conditioning. Our lightweight model requires trainable parameters and dataset size multiple orders of magnitude smaller than the current state-of-the-art IP-Adapter. However, our method successfully preserves the generative power of the frozen text-to-image (T2I) backbone. Notably, it excels in addressing mode collapse, a pervasive issue previously overlooked. Our novel architecture demonstrates outstanding capabilities in achieving a harmonious visual-text balance, unlocking unparalleled versatility in various human image generation tasks, including pose re-targeting, virtual try-on, stylization, person re-identification, and textile transfer.Demo and code are available from project page https://soon-yau.github.io/visconet/ .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
MethodsDiffusion · Latent Diffusion Model
