Tucano 2 Cool: Better Open Source LLMs for Portuguese
Nicholas Kluge Corr\^ea, Aniket Sen, Shiza Fatimah, Sophia Falk, Lennard Landgraf, Julia Kastner, Lucie Flek

TL;DR
Tucano 2 is an open-source suite of Portuguese LLMs with improved datasets, training recipes, and evaluation methods, achieving state-of-the-art performance and fostering community access and reproducibility.
Contribution
We introduce Tucano 2, a comprehensive open-source Portuguese LLM suite with enhanced datasets, training procedures, and evaluation tools, setting new performance standards in the field.
Findings
Achieved state-of-the-art results on Portuguese benchmarks.
Developed new datasets and training recipes for better domain adaptation.
Provided open access to all artifacts for community use.
Abstract
We present Tucano 2, a fully open suite of large language models (LLMs) with 0.5-3.7 billion parameters, designed to address certain gaps in open-source development for Portuguese LLMs. Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two post-training datasets, GigaVerbo-v2 SFT and GigaVerbo-v2 Preferences, that allow Portuguese LLMs to be trained in domains like retrieval augmented generation, coding, tool use, chain-of-thought reasoning, and many other domains of interest. Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Polygl0t/portuguese-bertimbau-edu-classifiermodel· 14 dl14 dl
- 🤗Polygl0t/portuguese-bertimbau-large-edu-classifiermodel· 14 dl14 dl
- 🤗Polygl0t/portuguese-bertabaporu-large-toxicity-classifiermodel· 14 dl14 dl
- 🤗Polygl0t/portuguese-bertimbau-toxicity-classifiermodel· 16 dl16 dl
- 🤗Polygl0t/portuguese-qwen3-4b-instruct-quality-classifiermodel· 21 dl21 dl
- 🤗Polygl0t/portuguese-qwen3-4b-instruct-quality-judgemodel· 20 dl20 dl
- 🤗Polygl0t/GigaVerbo-v2-ablation-EDU-Synth-1.5Bmodel· 19 dl19 dl
- 🤗Polygl0t/GigaVerbo-v2-ablation-NonEDU-1.5Bmodel· 16 dl16 dl
- 🤗Polygl0t/GigaVerbo-v2-ablation-EDU-1.5Bmodel· 17 dl17 dl
- 🤗Polygl0t/Tucano2-0.6B-Basemodel· 44 dl44 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
