MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data

Zhekai Chen; Yuqing Wang; Manyuan Zhang; Xihui Liu

arXiv:2603.25319·cs.CV·March 27, 2026

MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data

Zhekai Chen, Yuqing Wang, Manyuan Zhang, Xihui Liu

PDF

Open Access

TL;DR

This paper introduces MacroData, a large-scale dataset with structured long-context references, and MacroBench, a benchmark for multi-reference image generation, significantly improving model performance on complex, multi-input tasks.

Contribution

The paper presents MacroData and MacroBench, enabling better training and evaluation of multi-reference image generation models with long-context supervision.

Findings

01

Fine-tuning on MacroData improves generation quality.

02

Cross-task co-training yields synergistic benefits.

03

Effective long-context strategies enhance performance.

Abstract

Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection