mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Matthieu Futeral, Armel Zebaze, Pedro Ortiz Suarez, Julien Abadji, R\'emi Lacroix, Cordelia Schmid, Rachel Bawden, Beno\^it Sagot

TL;DR
mOSCAR is a comprehensive multilingual and multimodal document corpus from the web, enabling improved training of large language models across 163 languages with demonstrated benefits in few-shot learning performance.
Contribution
This paper introduces mOSCAR, the first large-scale multilingual multimodal dataset, and shows its effectiveness in enhancing multilingual multimodal model performance.
Findings
Models trained on mOSCAR outperform caption-only models in multilingual tasks.
mOSCAR covers 163 languages with 303 million documents and 200 billion tokens.
Training on mOSCAR improves few-shot learning across diverse benchmarks.
Abstract
Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. (2022) showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have been attempts to reproduce their results but the released datasets are English-only. In contrast, current multilingual and multimodal datasets are either composed of caption-like only or medium-scale or fully private data. This limits mLLM research for the 7,000 other languages spoken in the world. We therefore introduce mOSCAR, to the best of our knowledge the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 303M documents, 200B…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The choice of filters was discussed thoroughly. Relative to other literature in the field, it is clear that an exhaustive search over the space of possible data filtering methodologies was done. 2. The empirical results are quite promising. The set of evaluations/benchmarks is extensive and exhaustive. Moreover, the performance across different shotting of the models demonstrates significant improvmenets over existing datasets. 3. I found the discussion around diversity measurements and com
1. I found some of the choices, while fairly standard, for the design of this framework to be lacking in terms of thought in terms of their design. While most of the decisions are cited, several decisions could have been had more analysis. 2. Ablations on the filters would have been very interesting to see. This includes the joint image-text filtering.
mOSCAR is derived from Common Crawl and utilizes sophisticated methods to filter and refine data. Steps include document deduplication, toxicity filtering, PII removal, and NSFW detection, ensuring a robust dataset suitable for mLLMs. Covering 163 languages, mOSCAR significantly expands the multilingual training dataset landscape, making it inclusive of languages that are often underrepresented in large-scale datasets. The authors conducted extensive benchmark tests, demonstrating that models tr
1. Limited manual validation (1,000 images for NSFW, 100 documents for toxic content) 2. English-centric regex patterns may miss unsafe content in other languages 3. Binary toxicity detection requiring two distinct toxic words could miss subtle harmful content 4. Document-level NSFW filtering is overly aggressive, potentially discarding valuable safe content 5. Reliance on regex and wordlists could be improved with neural-based approaches 6. Fixed character (300) and node count (5) thresholds ma
1. **Novelty and usefulness of contribution** -- There currently does not exist a multilingual multimodal interleaved dataset at this scale, so this artifact is something that would certainly be useful to the community. - This differs from other multilingual multimodal datasets (e.g. LAION-5B, mC4) since it is focused more on interleaved text+images as opposed to simple captioning. - This differs from other interleaved datasets (e.g. IDEFICS) because it is multilingual as opposed to pure
1. **No dataset optimization** -- The paper shows that a Flamingo-style model trained with mOSCAR outperforms a translate-test baseline. This is great, but the paper didn't really mention how much iteration they did when designing this dataset. Were there multiple rounds or iterations of creating mOSCAR? There seems to be a lack of ablations, so to me it felt like the model training experiments were more of an afterthought after the dataset had already been created. (Maybe that was the intention
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training
