Conversations in Galician: a Large Language Model for an   Underrepresented Language

Eliseo Bao; Anxo P\'erez; Javier Parapar

arXiv:2311.03812·cs.CL·November 8, 2023·1 cites

Conversations in Galician: a Large Language Model for an Underrepresented Language

Eliseo Bao, Anxo P\'erez, Javier Parapar

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces a Galician language adaptation of a large instruction dataset and fine-tunes a language model to support Galician, addressing the underrepresentation of low-resource languages in AI.

Contribution

It provides a novel Galician instruction dataset and demonstrates fine-tuning of LLaMA-7B for Galician, enhancing NLP capabilities for low-resource languages.

Findings

01

Galician dataset improves instruction adherence in models

02

Fine-tuned LLaMA-7B responds accurately in Galician

03

Knowledge of Portuguese aids in low-resource language modeling

Abstract

The recent proliferation of Large Conversation Language Models has highlighted the economic significance of widespread access to this type of AI technologies in the current information age. Nevertheless, prevailing models have primarily been trained on corpora consisting of documents written in popular languages. The dearth of such cutting-edge tools for low-resource languages further exacerbates their underrepresentation in the current economic landscape, thereby impacting their native speakers. This paper introduces two novel resources designed to enhance Natural Language Processing (NLP) for the Galician language. We present a Galician adaptation of the Alpaca dataset, comprising 52,000 instructions and demonstrations. This dataset proves invaluable for enhancing language models by fine-tuning them to more accurately adhere to provided instructions. Additionally, as a demonstration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://gitlab.irlab.org/irlab/cabuxa
noneOfficial

Models

🤗
irlab-udc/cabuxa-7b
model· 10 dl· ♡ 6
10 dl♡ 6

Datasets

irlab-udc/alpaca_data_galician
dataset· 12 dl
12 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling