MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation

Yuta Oshima; Daiki Miyake; Kohsei Matsutani; Yusuke Iwasawa; Masahiro Suzuki; Yutaka Matsuo; Hiroki Furuta

arXiv:2511.22989·cs.CV·March 27, 2026

MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation

Yuta Oshima, Daiki Miyake, Kohsei Matsutani, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta

PDF

Open Access 1 Datasets

TL;DR

MultiBanana is a comprehensive benchmark designed to evaluate and challenge multi-reference text-to-image generation models across diverse, complex scenarios including multiple references, domain and scale mismatches, rare concepts, and multilingual inputs.

Contribution

The paper introduces MultiBanana, a new benchmark dataset that covers diverse and challenging multi-reference scenarios to better evaluate and compare text-to-image models.

Findings

01

Models show varied performance across different challenges.

02

Common failure modes include domain mismatch and rare concept handling.

03

Benchmark reveals areas for future model improvements.

Abstract

Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; that is, to inherit the appearance of subjects from multiple reference images and re-render them in new contexts. However, existing benchmark datasets often focus on generation using a single or a few reference images, which prevents us from measuring progress in model performance or identifying weaknesses when following instructions with a larger number of references. In addition, their task definitions are still vague, limited to axes such as ``what to edit'' or ``how many references are given'', and therefore fail to capture the challenges inherent in combining heterogeneous references. To address this gap, we introduce MultiBanana, which is designed to assess the edge of model capabilities by widely covering problems specific to multi-reference settings: (1) varying the number…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

kohsei/MultiBanana-Benchmark
dataset· 2.8k dl
2.8k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship