A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Yukang Feng; Jianwen Sun; Chuanhao Li; Zizhen Li; Jiaxin Ai; Fanrui Zhang; Yifan Chang; Sizhuo Zhou; Shenglin Zhang; Yu Dai; Kaipeng Zhang

arXiv:2506.09427·cs.CV·March 3, 2026

A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Yukang Feng, Jianwen Sun, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Yifan Chang, Sizhuo Zhou, Shenglin Zhang, Yu Dai, Kaipeng Zhang

PDF

Open Access 2 Datasets

TL;DR

This paper introduces InterSyn, a large, high-quality dataset with diverse instructions for interleaved image-text generation, and SynJudge, an automatic evaluation method closely aligned with human judgment, to advance multimodal models.

Contribution

The paper presents InterSyn, a novel large-scale dataset with quality refinement and diverse instructions, along with SynJudge, a reliable automatic evaluation metric for interleaved image-text generation.

Findings

01

Scaling data improves model performance across metrics.

02

Smaller datasets still yield significant improvements.

03

InterSyn enhances training for interactive multimodal models.

Abstract

Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality, and instructional richness of current training datasets. To address this, we introduce InterSyn, a dataset that features: (1) large scale, comprising 1.8M multimodal samples; (2) high quality, supported by our proposed Self-Evaluation with Iterative Refinement (SEIR) method for rigorous automated quality refinement; (3) rich instructional diversity, ensured through diverse well-designed question templates, based on human preferences and covering a 3500-topic hierarchy. These characteristics make InterSyn particularly well-suited for training LMMs in interactive image-text generation capabilities. To evaluate the capabilities, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling