Small Languages, Big Models: A Study of Continual Training on Languages   of Norway

David Samuel; Vladislav Mikhailov; Erik Velldal; Lilja {\O}vrelid,; Lucas Georges Gabriel Charpentier; Andrey Kutuzov; Stephan Oepen

arXiv:2412.06484·cs.CL·February 4, 2025

Small Languages, Big Models: A Study of Continual Training on Languages of Norway

David Samuel, Vladislav Mikhailov, Erik Velldal, Lilja {\O}vrelid,, Lucas Georges Gabriel Charpentier, Andrey Kutuzov, Stephan Oepen

PDF

Open Access 2 Models

TL;DR

This paper introduces a three-stage continual training method to improve large language models for low-resource languages like Norwegian and Northern Sami, resulting in a new 11.4-billion-parameter model that enhances performance and efficiency.

Contribution

It presents a novel continual training approach specifically designed for low-resource languages, and releases a new large language model for Norwegian and Northern Sami.

Findings

01

Significant performance improvements in target languages.

02

Efficient inference for low-resource languages.

03

Open release of the NorMistral-11B model.

Abstract

Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern S\'ami. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokm\r{a}l, Nynorsk, and Northern S\'ami with 11.4 billion parameters: NorMistral-11B.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecond Language Learning and Teaching · Higher Education Learning Practices