MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data
Zhekai Chen, Yuqing Wang, Manyuan Zhang, Xihui Liu

TL;DR
This paper introduces MacroData, a large-scale dataset with structured long-context references, and MacroBench, a benchmark for multi-reference image generation, significantly improving model performance on complex, multi-input tasks.
Contribution
The paper presents MacroData and MacroBench, enabling better training and evaluation of multi-reference image generation models with long-context supervision.
Findings
Fine-tuning on MacroData improves generation quality.
Cross-task co-training yields synergistic benefits.
Effective long-context strategies enhance performance.
Abstract
Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection
