MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar, Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming, Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt

TL;DR
This paper introduces MINT-1T, the largest open-source multimodal interleaved dataset with one trillion tokens and 3.4 billion images, significantly advancing the scale and diversity of resources for training large multimodal models.
Contribution
We present MINT-1T, a 10x larger and more diverse open-source dataset for multimodal interleaved sequences, including new sources like PDFs and ArXiv papers, and share the data curation process.
Findings
Models trained on MINT-1T perform comparably to those trained on OBELICS.
MINT-1T significantly increases data scale and diversity for multimodal training.
The dataset includes previously untapped sources like PDFs and scientific papers.
Abstract
Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, diverse open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date. MINT-1T comprises one trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. As scaling multimodal interleaved datasets requires substantial engineering effort, sharing the data curation process and releasing the dataset greatly benefits the community. Our experiments show that LMMs trained on MINT-1T rival the performance of models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
