UCCIX: Irish-eXcellence Large Language Model
Khanh-Tung Tran, Barry O'Sullivan, Hoang D. Nguyen

TL;DR
This paper introduces UCCIX, an open-source Irish LLM based on Llama 2-13B, developed with a novel low-resource adaptation framework, outperforming larger models on Irish tasks and providing new benchmarking datasets.
Contribution
The paper presents a novel framework for low-resource language adaptation of LLMs and introduces Irish-specific datasets, advancing Irish language AI capabilities.
Findings
UCCIX outperforms larger models on Irish tasks with up to 12% improvement.
The framework requires only a fraction of typical training data for low-resource languages.
New Irish benchmarking datasets enable rigorous evaluation of Irish LLMs.
Abstract
The development of Large Language Models (LLMs) has predominantly focused on high-resource languages, leaving extremely low-resource languages like Irish with limited representation. This work presents UCCIX, a pioneering effort on the development of an open-source Irish-based LLM. We propose a novel framework for continued pre-training of LLMs specifically adapted for extremely low-resource languages, requiring only a fraction of the textual data typically needed for training LLMs according to scaling laws. Our model, based on Llama 2-13B, outperforms much larger models on Irish language tasks with up to 12% performance improvement, showcasing the effectiveness and efficiency of our approach. We also contribute comprehensive Irish benchmarking datasets, including IrishQA, a question-answering dataset, and Irish version of MT-bench. These datasets enable rigorous evaluation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsLLaMA
