OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Hugo Lauren\c{c}on, Lucile Saulnier, L\'eo Tronchon, Stas Bekman,, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander, M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh

TL;DR
OBELICS is a large, openly available web-scale dataset of interleaved image-text documents, enabling training of multimodal models that outperform prior models on benchmarks.
Contribution
The paper introduces OBELICS, a comprehensive, filtered dataset of 141 million web pages with interleaved images and text, and demonstrates its effectiveness by training large multimodal models.
Findings
Models trained on OBELICS achieve competitive benchmark performance.
The dataset enables training of models with up to 80 billion parameters.
Open release of dataset, models, and code facilitates further research.
Abstract
Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documents comprising 141 million web pages extracted from Common Crawl, 353 million associated images, and 115 billion text tokens. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. To show the viability of OBELICS, we train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks. We release our dataset, models and code.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗HuggingFaceM4/idefics-80bmodel· 331 dl· ♡ 69331 dl♡ 69
- 🤗HuggingFaceM4/idefics-9bmodel· 1.9k dl· ♡ 471.9k dl♡ 47
- 🤗HuggingFaceM4/idefics-9b-instructmodel· 1.2k dl· ♡ 1071.2k dl♡ 107
- 🤗HuggingFaceM4/idefics-80b-instructmodel· 5.3k dl· ♡ 1895.3k dl♡ 189
- 🤗areegtarek/idefics-9b-instruct-allmodel· 12 dl12 dl
- 🤗HuggingFaceM4/idefics2-8b-basemodel· 1.6k dl· ♡ 281.6k dl♡ 28
- 🤗HuggingFaceM4/idefics2-8bmodel· 157k dl· ♡ 620157k dl♡ 620
- 🤗HuggingFaceM4/idefics2-8b-chattymodel· 70 dl· ♡ 9570 dl♡ 95
- 🤗Trelis/idefics2-8b-chatty-bf16model· 8 dl· ♡ 18 dl♡ 1
- 🤗Reverb/Idefics2-8b-docVQA-finetunedmodel· 7 dl· ♡ 37 dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
