Gl\'orIA -- A Generative and Open Large Language Model for Portuguese

Ricardo Lopes; Jo\~ao Magalh\~aes; David Semedo

arXiv:2402.12969·cs.CL·February 21, 2024·1 cites

Gl\'orIA -- A Generative and Open Large Language Model for Portuguese

Ricardo Lopes, Jo\~ao Magalh\~aes, David Semedo

PDF

Open Access 1 Datasets

TL;DR

Glória is a large, open-source Portuguese language model trained on 35 billion tokens, demonstrating superior performance in language modeling and downstream tasks, and introducing a new Portuguese benchmark.

Contribution

The paper presents Glória, a novel large language model for Portuguese, along with a new benchmark CALAME-PT for evaluating Portuguese language models.

Findings

01

Glória outperforms existing open PT models in language modeling.

02

It generates coherent, knowledge-rich Portuguese text.

03

The model shows strong potential across various downstream tasks.

Abstract

Significant strides have been made in natural language tasks, largely attributed to the emergence of powerful large language models (LLMs). These models, pre-trained on extensive and diverse corpora, have become increasingly capable of comprehending the intricacies of language. Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese. We introduce Gl\'orIA, a robust European Portuguese decoder LLM. To pre-train Gl\'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. We present our pre-training methodology, followed by an assessment of the model's effectiveness on multiple downstream tasks. Additionally, to evaluate our models' language modeling capabilities, we introduce CALAME-PT (Context-Aware LAnguage Modeling Evaluation for Portuguese), the first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Polygl0t/CALAME-PT
dataset· 58 dl
58 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems