Language Resources for Dutch Large Language Modelling
Bram Vanroy

TL;DR
This paper introduces Dutch-specific language models based on Llama 2, along with datasets and benchmarks, to address the gap in Dutch NLP resources and facilitate future research and development.
Contribution
It presents fine-tuned Dutch Llama 2 models, new datasets, a performance leaderboard, and a critical discussion on advancing Dutch language modeling.
Findings
Fine-tuned Dutch Llama 2 models outperform baseline models
New datasets improve Dutch language understanding and generation
Benchmark results highlight current strengths and gaps in Dutch NLP
Abstract
Despite the rapid expansion of types of large language models, there remains a notable gap in models specifically designed for the Dutch language. This gap is not only a shortage in terms of pretrained Dutch models but also in terms of data, and benchmarks and leaderboards. This work provides a small step to improve the situation. First, we introduce two fine-tuned variants of the Llama 2 13B model. We first fine-tuned Llama 2 using Dutch-specific web-crawled data and subsequently refined this model further on multiple synthetic instruction and chat datasets. These datasets as well as the model weights are made available. In addition, we provide a leaderboard to keep track of the performance of (Dutch) models on a number of generation tasks, and we include results of a number of state-of-the-art models, including our own. Finally we provide a critical conclusion on what we believe is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗robinsmits/open_llama_7b_alpaca_clean_dutch_qloramodel· 7 dl· ♡ 17 dl♡ 1
- 🤗robinsmits/open_llama_13b_alpaca_clean_dutch_qloramodel· 7 dl7 dl
- 🤗robinsmits/polylm_1.7b_ft_alpaca_clean_dutchmodel· 8 dl8 dl
- 🤗robinsmits/polylm_13b_ft_alpaca_clean_dutchmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗BramVanroy/llama2-13b-ft-mc4_nl_cleaned_tinymodel· 10 dl· ♡ 410 dl♡ 4
- 🤗BramVanroy/Llama-2-13b-chat-dutchmodel· 702 dl· ♡ 19702 dl♡ 19
- 🤗ChocoLlama/ChocoLlama-2-7B-instructmodel· 8 dl· ♡ 28 dl♡ 2
- 🤗ChocoLlama/ChocoLlama-2-7B-tokentrans-instructmodel· 14 dl· ♡ 114 dl♡ 1
- 🤗ChocoLlama/Llama-3-ChocoLlama-8B-instructmodel· 11 dl· ♡ 611 dl♡ 6
- 🤗RichardErkhov/BramVanroy_-_Llama-2-13b-chat-dutch-ggufmodel· 69 dl69 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Model-Driven Software Engineering Techniques
