Conversations in Galician: a Large Language Model for an Underrepresented Language
Eliseo Bao, Anxo P\'erez, Javier Parapar

TL;DR
This paper introduces a Galician language adaptation of a large instruction dataset and fine-tunes a language model to support Galician, addressing the underrepresentation of low-resource languages in AI.
Contribution
It provides a novel Galician instruction dataset and demonstrates fine-tuning of LLaMA-7B for Galician, enhancing NLP capabilities for low-resource languages.
Findings
Galician dataset improves instruction adherence in models
Fine-tuned LLaMA-7B responds accurately in Galician
Knowledge of Portuguese aids in low-resource language modeling
Abstract
The recent proliferation of Large Conversation Language Models has highlighted the economic significance of widespread access to this type of AI technologies in the current information age. Nevertheless, prevailing models have primarily been trained on corpora consisting of documents written in popular languages. The dearth of such cutting-edge tools for low-resource languages further exacerbates their underrepresentation in the current economic landscape, thereby impacting their native speakers. This paper introduces two novel resources designed to enhance Natural Language Processing (NLP) for the Galician language. We present a Galician adaptation of the Alpaca dataset, comprising 52,000 instructions and demonstrations. This dataset proves invaluable for enhancing language models by fine-tuning them to more accurately adhere to provided instructions. Additionally, as a demonstration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
