RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

Tiancheng Gu; Kaicheng Yang; Chaoyi Zhang; Yin Xie; Xiang An; Ziyong Feng; Dongnan Liu; Weidong Cai; Jiankang Deng

arXiv:2502.12513·cs.CV·August 6, 2025

RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

Tiancheng Gu, Kaicheng Yang, Chaoyi Zhang, Yin Xie, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng

PDF

Open Access 1 Repo 4 Datasets

TL;DR

RealSyn is a large-scale dataset of realistic and synthetic image-text pairs designed to improve contrastive vision-language models, demonstrating state-of-the-art results across multiple benchmarks.

Contribution

The paper introduces RealSyn, a novel scalable dataset with a hierarchical retrieval and augmentation pipeline, significantly enhancing contrastive learning performance.

Findings

01

Models trained on RealSyn outperform existing datasets in downstream tasks.

02

RealSyn improves zero-shot transfer and robustness.

03

The dataset demonstrates scalable benefits for vision-language pre-training.

Abstract

After pre-training on extensive image-text pairs, Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of multimodal interleaved documents remains underutilized for contrastive vision-language representation learning. To fully leverage these unpaired documents, we initially establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deepglint/realsyn
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies

MethodsContrastive Language-Image Pre-training