Fostering the Ecosystem of Open Neural Encoders for Portuguese with Albertina PT* Family
Rodrigo Santos, Jo\~ao Rodrigues, Lu\'is Gomes, Jo\~ao Silva,, Ant\'onio Branco, Henrique Lopes Cardoso, Tom\'as Freitas Os\'orio, Bernardo, Leite

TL;DR
This paper expands the ecosystem of open-source neural encoders for Portuguese by introducing larger and more efficient models, along with new datasets, to support research and commercial applications in the language.
Contribution
It presents a new 1.5 billion parameter encoder and a 100 million parameter model, extending the Portuguese language model ecosystem with open, high-performance resources.
Findings
Introduction of a 1.5B parameter encoder for Portuguese
Development of a 100M parameter efficient encoder
Creation of new Portuguese datasets based on SuperGLUE
Abstract
To foster the neural encoding of Portuguese, this paper contributes foundation encoder models that represent an expansion of the still very scarce ecosystem of large language models specifically developed for this language that are fully open, in the sense that they are open source and openly distributed for free under an open license for any purpose, thus including research and commercial usages. Like most languages other than English, Portuguese is low-resourced in terms of these foundational language resources, there being the inaugural 900 million parameter Albertina and 335 million Bertimbau. Taking this couple of models as an inaugural set, we present the extension of the ecosystem of state-of-the-art open encoders for Portuguese with a larger, top performance-driven model with 1.5 billion parameters, and a smaller, efficiency-driven model with 100 million parameters. While…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗PORTULAN/albertina-100m-portuguese-ptpt-encodermodel· 216 dl· ♡ 4216 dl♡ 4
- 🤗PORTULAN/albertina-100m-portuguese-ptbr-encodermodel· 283 dl· ♡ 7283 dl♡ 7
- 🤗PORTULAN/albertina-1b5-portuguese-ptpt-encodermodel· 78 dl· ♡ 978 dl♡ 9
- 🤗PORTULAN/albertina-1b5-portuguese-ptbr-encodermodel· 87 dl· ♡ 587 dl♡ 5
- 🤗PORTULAN/albertina-1b5-portuguese-ptbr-encoder-256model· 1 dl· ♡ 21 dl♡ 2
- 🤗PORTULAN/albertina-1b5-portuguese-ptpt-encoder-256model· 1 dl· ♡ 21 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
