MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
Yuta Oshima, Daiki Miyake, Kohsei Matsutani, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta

TL;DR
MultiBanana is a comprehensive benchmark designed to evaluate and challenge multi-reference text-to-image generation models across diverse, complex scenarios including multiple references, domain and scale mismatches, rare concepts, and multilingual inputs.
Contribution
The paper introduces MultiBanana, a new benchmark dataset that covers diverse and challenging multi-reference scenarios to better evaluate and compare text-to-image models.
Findings
Models show varied performance across different challenges.
Common failure modes include domain mismatch and rare concept handling.
Benchmark reveals areas for future model improvements.
Abstract
Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; that is, to inherit the appearance of subjects from multiple reference images and re-render them in new contexts. However, existing benchmark datasets often focus on generation using a single or a few reference images, which prevents us from measuring progress in model performance or identifying weaknesses when following instructions with a larger number of references. In addition, their task definitions are still vague, limited to axes such as ``what to edit'' or ``how many references are given'', and therefore fail to capture the challenges inherent in combining heterogeneous references. To address this gap, we introduce MultiBanana, which is designed to assess the edge of model capabilities by widely covering problems specific to multi-reference settings: (1) varying the number…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship
