Qomhra: A Bilingual Irish and English Large Language Model
Joseph McInerney, Khanh-Tung Tran, Liam Lonergan, Ailbhe N\'i Chasaide, Neasa N\'i Chiar\'ain, Barry Devereux

TL;DR
This paper presents Qomhrá, a bilingual Irish-English large language model developed with low-resource constraints, introducing novel methods for synthesizing human preference data and demonstrating significant performance improvements over existing Irish LLMs.
Contribution
The paper introduces a new bilingual Irish-English LLM, Qomhrá, with a novel data synthesis method and comprehensive evaluation, advancing low-resource language modeling.
Findings
Qomhrá outperforms UCCIX by up to 29% in Irish and 44% in English on benchmarks.
A novel LLM prompting method effectively synthesizes human preference data for low-resource languages.
Gemini-2.5-Pro is identified as the best LLM for Irish language generation among evaluated models.
Abstract
Large language model (LLM) research and development has overwhelmingly focused on the world's major languages, leading to under-representation of low-resource languages such as Irish. This paper introduces \textbf{Qomhr\'a}, a bilingual Irish and English LLM, developed under extremely low-resource constraints. A complete pipeline is outlined spanning bilingual continued pre-training, instruction tuning, and the synthesis of human preference data for future alignment training. We focus on the lack of scalable methods to create human preference data by proposing a novel method to synthesise such data by prompting an LLM to generate ``accepted'' and ``rejected'' responses, which we validate as aligning with L1 Irish speakers. To select an LLM for synthesis, we evaluate the top closed-weight LLMs for Irish language generation performance. Gemini-2.5-Pro is ranked highest by L1 and L2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
