MultiRef: Controllable Image Generation with Multiple Visual References

Ruoxi Chen; Dongping Chen; Siyuan Wu; Sinan Wang; Shiyun Lang; Petr Sushko; Gaoyang Jiang; Yao Wan; Ranjay Krishna

arXiv:2508.06905·cs.CV·August 27, 2025

MultiRef: Controllable Image Generation with Multiple Visual References

Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Petr Sushko, Gaoyang Jiang, Yao Wan, Ranjay Krishna

PDF

Open Access 2 Datasets

TL;DR

This paper introduces MultiRef, a new benchmark and dataset for evaluating controllable image generation using multiple visual references, revealing current models' limitations and guiding future improvements.

Contribution

The paper presents MultiRef-bench and MultiRef datasets, along with an analysis of state-of-the-art models' performance on multi-reference image generation tasks.

Findings

01

State-of-the-art models achieve only 66.6% accuracy on synthetic samples.

02

Models perform better on real-world samples, reaching 79.0%.

03

Current systems struggle with multi-reference conditioning, indicating room for improvement.

Abstract

Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs -- either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Aesthetic Perception and Analysis