XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation
Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, Xinglong Wu

TL;DR
XVerse introduces a novel method for multi-subject, fine-grained control in text-to-image diffusion models, enabling independent manipulation of subjects and attributes with high fidelity and coherence.
Contribution
It proposes a new multi-subject control technique using token-specific text-stream modulation, enhancing editability and attribute disentanglement in diffusion transformer-based image synthesis.
Findings
Enables precise, independent control over multiple subjects.
Maintains high image fidelity and coherence.
Improves attribute disentanglement and editability.
Abstract
Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsCognitive Computing and Networks · Robotics and Automated Systems · Big Data and Digital Economy
MethodsDiffusion
