OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye,, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu,, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian, Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan

TL;DR
OmniCorpus is a massive 10 billion-scale image-text interleaved dataset that enhances multimodal model training by providing diverse, high-quality, and flexible data sources, supporting advanced research in multimodal learning.
Contribution
The paper introduces OmniCorpus, the largest and most diverse image-text interleaved dataset to date, enabling improved multimodal large language models.
Findings
Dataset contains 8.6 billion images and 1,696 billion text tokens.
OmniCorpus is 15 times larger than comparable datasets.
The dataset is versatile, supporting various data formats and sources.
Abstract
Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Video Analysis and Summarization · Music and Audio Processing
