UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

Yiyan Xu; Qiulin Wang; Wenjie Wang; Yunyao Mao; Xintao Wang; Pengfei Wan; Kun Gai; Fuli Feng

arXiv:2605.12088·cs.CV·May 14, 2026

UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation

Yiyan Xu, Qiulin Wang, Wenjie Wang, Yunyao Mao, Xintao Wang, Pengfei Wan, Kun Gai, Fuli Feng

PDF

TL;DR

UniCustom introduces a unified visual conditioning framework that fuses semantic and appearance features early in the process, significantly enhancing multi-reference image generation accuracy and fidelity.

Contribution

It proposes a novel early fusion approach with a two-stage training strategy and slot-wise regularization to improve subject consistency and reduce cross-reference confusion.

Findings

01

Improves subject identity preservation in multi-reference generation.

02

Enhances instruction following and compositional fidelity.

03

Outperforms strong baselines on benchmark datasets.

Abstract

Multi-reference image generation aims to synthesize images from textual instructions while faithfully preserving subject identities from multiple reference images. Existing VLM-enhanced diffusion models commonly rely on decoupled visual conditioning: semantic ViT features are processed by the VLM for instruction understanding, whereas appearance-rich VAE features are injected later into the diffusion backbone. Despite its intuitive design, this separation makes it difficult for the model to associate each semantically grounded subject with visual details from the correct reference image. As a result, the model may recognize which subject is being referred to, but fail to preserve its identity and fine-grained appearance, leading to attribute leakage and cross-reference confusion in complex multi-reference settings. To address this issue, we propose UniCustom, a unified visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.