ChocoLlama: Lessons Learned From Teaching Llamas Dutch

Matthieu Meeus; Anthony Rath\'e; Fran\c{c}ois Remy; Pieter Delobelle,; Jens-Joris Decorte; Thomas Demeester

arXiv:2412.07633·cs.CL·December 11, 2024

ChocoLlama: Lessons Learned From Teaching Llamas Dutch

Matthieu Meeus, Anthony Rath\'e, Fran\c{c}ois Remy, Pieter Delobelle,, Jens-Joris Decorte, Thomas Demeester

PDF

Open Access 10 Models

TL;DR

This paper investigates methods for adapting English-centric LLMs, specifically Llama-2 and Llama-3, to Dutch by collecting Dutch data, applying continued pretraining with LoRA, and experimenting with tokenizers, revealing insights into effective language adaptation strategies.

Contribution

The study compares adaptation techniques for Llama models to Dutch, highlighting the effectiveness of LoRA and tokenizer modifications, and provides a new Dutch benchmark for LLM evaluation.

Findings

01

LoRA effectively scales for language adaptation

02

Tokenizer modification with reinitialization improves performance

03

Llama-3 outperforms adapted Llama-2 in Dutch capabilities

Abstract

While Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation, their performance often lags in lower-resource, non-English languages due to biases in the training data. In this work, we explore strategies for adapting the primarily English LLMs (Llama-2 and Llama-3) to Dutch, a language spoken by 30 million people worldwide yet often underrepresented in LLM development. We collect 104GB of Dutch text ( $32$ B tokens) from various sources to first apply continued pretraining using low-rank adaptation (LoRA), complemented with Dutch posttraining strategies provided by prior work. For Llama-2, we consider using (i) the tokenizer of the original model, and (ii) training a new, Dutch-specific tokenizer combined with embedding reinitialization. We evaluate our adapted models, ChocoLlama-2, both on standard benchmarks and a novel Dutch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPlant and Fungal Species Descriptions