Gl\'orIA -- A Generative and Open Large Language Model for Portuguese
Ricardo Lopes, Jo\~ao Magalh\~aes, David Semedo

TL;DR
Glória is a large, open-source Portuguese language model trained on 35 billion tokens, demonstrating superior performance in language modeling and downstream tasks, and introducing a new Portuguese benchmark.
Contribution
The paper presents Glória, a novel large language model for Portuguese, along with a new benchmark CALAME-PT for evaluating Portuguese language models.
Findings
Glória outperforms existing open PT models in language modeling.
It generates coherent, knowledge-rich Portuguese text.
The model shows strong potential across various downstream tasks.
Abstract
Significant strides have been made in natural language tasks, largely attributed to the emergence of powerful large language models (LLMs). These models, pre-trained on extensive and diverse corpora, have become increasingly capable of comprehending the intricacies of language. Despite the abundance of LLMs for many high-resource languages, the availability of such models remains limited for European Portuguese. We introduce Gl\'orIA, a robust European Portuguese decoder LLM. To pre-train Gl\'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources. We present our pre-training methodology, followed by an assessment of the model's effectiveness on multiple downstream tasks. Additionally, to evaluate our models' language modeling capabilities, we introduce CALAME-PT (Context-Aware LAnguage Modeling Evaluation for Portuguese), the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
