Tucano: Advancing Neural Text Generation for Portuguese
Nicholas Kluge Corr\^ea, Aniket Sen, Sophia Falk, Shiza Fatimah

TL;DR
This paper introduces Tucano, a series of Portuguese language models trained on a large, deduplicated corpus, demonstrating competitive performance and highlighting evaluation limitations in Portuguese NLP.
Contribution
The study presents GigaVerbo, a large Portuguese text corpus, and trains Tucano models that outperform similar-sized models, advancing neural text generation for Portuguese.
Findings
Tucano models perform equal or better than comparable models on Portuguese benchmarks.
Evaluation metrics show limited correlation with training data scale.
Open-source release of models and resources for community use.
Abstract
Significant advances have been made in natural language processing in recent years. However, our current deep learning approach to language modeling requires substantial resources in terms of data and computation. One of the side effects of this data-hungry paradigm is the current schism between languages, separating those considered high-resource, where most of the development happens and resources are available, and the low-resource ones, which struggle to attain the same level of performance and autonomy. This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese. In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens. Via this corpus, we trained a series of decoder-transformers named Tucano. Our models perform equal or superior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗TucanoBR/XGBClassifier-text-filtermodel
- 🤗TucanoBR/XGBRegressor-text-filtermodel
- 🤗TucanoBR/BERTimbau-large-text-filtermodel· 4 dl4 dl
- 🤗TucanoBR/BERTimbau-base-text-filtermodel· 22 dl22 dl
- 🤗TucanoBR/Tucano-160mmodel· 2.7k dl· ♡ 42.7k dl♡ 4
- 🤗TucanoBR/Tucano-630mmodel· 234 dl· ♡ 4234 dl♡ 4
- 🤗TucanoBR/Tucano-1b1model· 623 dl· ♡ 3623 dl♡ 3
- 🤗TucanoBR/Tucano-1b1-Instructmodel· 265 dl· ♡ 5265 dl♡ 5
- 🤗TucanoBR/Tucano-2b4model· 254 dl· ♡ 5254 dl♡ 5
- 🤗TucanoBR/Tucano-2b4-Instructmodel· 540 dl· ♡ 7540 dl♡ 7
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
MethodsSparse Evolutionary Training
