\"UberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset
DatologyAI: Aldo Gael Carranza, Kaleigh Mentzer, Ricardo Pio Monti, Alex Fang, Alvin Deng, Amro Abbas, Anshuman Suri, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Diego Kiner, Fan Pan, Haakon Mongstad, Haoli Yin, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills

TL;DR
This paper demonstrates that targeted multilingual data curation significantly improves model performance and efficiency across thirteen languages, enabling high-quality multilingual models with less compute and addressing the 'curse of multilinguality.'
Contribution
It introduces a data curation approach that enhances multilingual model training, reducing compute costs and mitigating interference, validated on a 20-trillion-token dataset from public sources.
Findings
Curating data improves performance across languages.
Targeted curation reduces training compute by 4-10x.
Curated datasets enable effective large-scale multilingual models.
Abstract
Multilinguality is a core capability for modern foundation models, yet training high-quality multilingual models remains challenging due to uneven data availability across languages. A further challenge is the performance interference that can arise from joint multilingual training, commonly referred to as the "curse of multilinguality". We study multilingual data curation across thirteen languages and find that many reported regressions are not inherent to multilingual scaling but instead stem from correctable deficiencies in data quality and composition rather than fundamental capacity limits. In controlled bilingual experiments, improving data quality for any single language benefits others: curating English improves non-English performance in 12 of 13 languages, while curating non-English yields reciprocal improvements in English. Bespoke per-language curation produces substantially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Computational and Text Analysis Methods
