DVI: Disentangling Semantic and Visual Identity for Training-Free Personalized Generation
Guandong Li, Yijun Ding

TL;DR
DVI is a zero-shot framework that disentangles semantic and visual identity to improve personalized image generation, enhancing visual consistency and atmospheric fidelity without training.
Contribution
DVI introduces a novel disentanglement of identity into semantic and visual streams using VAE statistics and a parameter-free modulation, enabling training-free personalization.
Findings
Significantly improves visual consistency and atmospheric fidelity.
Outperforms state-of-the-art methods in identity preservation.
Operates without parameter fine-tuning.
Abstract
Recent tuning-free identity customization methods achieve high facial fidelity but often overlook visual context, such as lighting, skin texture, and environmental tone. This limitation leads to ``Semantic-Visual Dissonance,'' where accurate facial geometry clashes with the input's unique atmosphere, causing an unnatural ``sticker-like'' effect. We propose **DVI (Disentangled Visual-Identity)**, a zero-shot framework that orthogonally disentangles identity into fine-grained semantic and coarse-grained visual streams. Unlike methods relying solely on semantic vectors, DVI exploits the inherent statistical properties of the VAE latent space, utilizing mean and variance as lightweight descriptors for global visual atmosphere. We introduce a **Parameter-Free Feature Modulation** mechanism that adaptively modulates semantic embeddings with these visual statistics, effectively injecting the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · 3D Shape Modeling and Analysis
