UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
Yiyan Xu, Qiulin Wang, Wenjie Wang, Yunyao Mao, Xintao Wang, Pengfei Wan, Kun Gai, Fuli Feng

TL;DR
UniCustom introduces a unified visual conditioning framework that fuses semantic and appearance features early in the process, significantly enhancing multi-reference image generation accuracy and fidelity.
Contribution
It proposes a novel early fusion approach with a two-stage training strategy and slot-wise regularization to improve subject consistency and reduce cross-reference confusion.
Findings
Improves subject identity preservation in multi-reference generation.
Enhances instruction following and compositional fidelity.
Outperforms strong baselines on benchmark datasets.
Abstract
Multi-reference image generation aims to synthesize images from textual instructions while faithfully preserving subject identities from multiple reference images. Existing VLM-enhanced diffusion models commonly rely on decoupled visual conditioning: semantic ViT features are processed by the VLM for instruction understanding, whereas appearance-rich VAE features are injected later into the diffusion backbone. Despite its intuitive design, this separation makes it difficult for the model to associate each semantically grounded subject with visual details from the correct reference image. As a result, the model may recognize which subject is being referred to, but fail to preserve its identity and fine-grained appearance, leading to attribute leakage and cross-reference confusion in complex multi-reference settings. To address this issue, we propose UniCustom, a unified visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
