LLMs for Extremely Low-Resource Finno-Ugric Languages
Taido Purason, Hele-Andra Kuulmets, Mark Fishel

TL;DR
This paper develops and evaluates large language models for underrepresented Finno-Ugric languages, covering data collection, model training, and human evaluation to promote linguistic diversity in NLP.
Contribution
It introduces multilingual base and instruction-tuned models for Voro, Livonian, and Komi, along with new evaluation benchmarks and human assessments.
Findings
Successful creation of multilingual LLMs for low-resource languages
Benchmark datasets and human evaluations demonstrate model effectiveness
Promotes linguistic diversity in NLP applications
Abstract
The advancement of large language models (LLMs) has predominantly focused on high-resource languages, leaving low-resource languages, such as those in the Finno-Ugric family, significantly underrepresented. This paper addresses this gap by focusing on V\~oro, Livonian, and Komi. We cover almost the entire cycle of LLM creation, from data collection to instruction tuning and evaluation. Our contributions include developing multilingual base and instruction-tuned models; creating evaluation benchmarks, including the smugri-MT-bench multi-turn conversational benchmark; and conducting human evaluation. We intend for this work to promote linguistic diversity, ensuring that lesser-resourced languages can benefit from advancements in NLP.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsBalanced Selection
