DUMB: A Benchmark for Smart Evaluation of Dutch Models
Wietse de Vries, Martijn Wieling, Malvina Nissim

TL;DR
The paper introduces DUMB, a comprehensive Dutch language model benchmark with diverse datasets and a novel evaluation metric, enabling more accurate assessment and fostering future research on Dutch NLP models.
Contribution
It presents the Dutch Model Benchmark (DUMB), including new datasets and the Relative Error Reduction metric, to improve evaluation and comparison of Dutch language models.
Findings
Current Dutch monolingual models underperform.
Larger models and diverse architectures improve performance.
DeBERTaV3, XLM-R, and mDeBERTaV3 achieve top results.
Abstract
We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a diverse set of datasets for low-, medium- and high-resource tasks. The total set of nine tasks includes four tasks that were previously not available in Dutch. Instead of relying on a mean score across tasks, we propose Relative Error Reduction (RER), which compares the DUMB performance of language models to a strong baseline which can be referred to in the future even when assessing different sets of language models. Through a comparison of 14 pre-trained language models (mono- and multi-lingual, of varying sizes), we assess the internal consistency of the benchmark tasks, as well as the factors that likely enable high performance. Our results indicate that current Dutch monolingual models under-perform and suggest training larger Dutch models with other architectures and pre-training objectives. At present, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsXLM-R
