OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images   Interleaved with Text

Qingyun Li; Zhe Chen; Weiyun Wang; Wenhai Wang; Shenglong Ye,; Zhenjiang Jin; Guanzhou Chen; Yinan He; Zhangwei Gao; Erfei Cui; Jiashuo Yu,; Hao Tian; Jiasheng Zhou; Chao Xu; Bin Wang; Xingjian Wei; Wei Li; Wenjian; Zhang; Bo Zhang; Pinlong Cai; Licheng Wen; Xiangchao Yan; Zhenxiang Li; Pei; Chu; Yi Wang; Min Dou; Changyao Tian; Xizhou Zhu; Lewei Lu; Yushi Chen,; Junjun He; Zhongying Tu; Tong Lu; Yali Wang; Limin Wang; Dahua Lin; Yu Qiao,; Botian Shi; Conghui He; Jifeng Dai

arXiv:2406.08418·cs.CV·July 15, 2024·1 cites

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye,, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu,, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian, Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan

PDF

Open Access 1 Repo 1 Models 3 Datasets

TL;DR

OmniCorpus is a massive 10 billion-scale image-text interleaved dataset that enhances multimodal model training by providing diverse, high-quality, and flexible data sources, supporting advanced research in multimodal learning.

Contribution

The paper introduces OmniCorpus, the largest and most diverse image-text interleaved dataset to date, enabling improved multimodal large language models.

Findings

01

Dataset contains 8.6 billion images and 1,696 billion text tokens.

02

OmniCorpus is 15 times larger than comparable datasets.

03

The dataset is versatile, supporting various data formats and sources.

Abstract

Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opengvlab/omnicorpus
pytorchOfficial

Models

🤗
Qingyun/OmniCorpus-InternVL
model· 3 dl· ♡ 6
3 dl♡ 6

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Video Analysis and Summarization · Music and Audio Processing