LLMs for Extremely Low-Resource Finno-Ugric Languages

Taido Purason; Hele-Andra Kuulmets; Mark Fishel

arXiv:2410.18902·cs.CL·May 6, 2025

LLMs for Extremely Low-Resource Finno-Ugric Languages

Taido Purason, Hele-Andra Kuulmets, Mark Fishel

PDF

Open Access 1 Repo 1 Models 3 Datasets 1 Video

TL;DR

This paper develops and evaluates large language models for underrepresented Finno-Ugric languages, covering data collection, model training, and human evaluation to promote linguistic diversity in NLP.

Contribution

It introduces multilingual base and instruction-tuned models for Voro, Livonian, and Komi, along with new evaluation benchmarks and human assessments.

Findings

01

Successful creation of multilingual LLMs for low-resource languages

02

Benchmark datasets and human evaluations demonstrate model effectiveness

03

Promotes linguistic diversity in NLP applications

Abstract

The advancement of large language models (LLMs) has predominantly focused on high-resource languages, leaving low-resource languages, such as those in the Finno-Ugric family, significantly underrepresented. This paper addresses this gap by focusing on V\~oro, Livonian, and Komi. We cover almost the entire cycle of LLM creation, from data collection to instruction tuning and evaluation. Our contributions include developing multilingual base and instruction-tuned models; creating evaluation benchmarks, including the smugri-MT-bench multi-turn conversational benchmark; and conducting human evaluation. We intend for this work to promote linguistic diversity, ensuring that lesser-resourced languages can benefit from advancements in NLP.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tartunlp/smugri-llm
pytorchOfficial

Models

🤗
tartuNLP/Llama-SMUGRI-7B-Instruct-MTI
model· 3 dl
3 dl

Datasets

Videos

LLMs for Extremely Low-Resource Finno-Ugric Languages· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsBalanced Selection