Sabi\'a-4 Technical Report
Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Roseval Malaquias Junior, Giovana Kerche Bon\'as, Marcos Piau, Celio Larcher, Ramon Pires, Rodrigo Nogueira

TL;DR
Sabi'a-4 and Sabiazinho-4 are advanced Portuguese language models optimized for Brazilian Portuguese, demonstrating superior performance in legal, conversational, and agentic tasks through extensive training and evaluation.
Contribution
Introduction of Sabi'a-4 and Sabiazinho-4 models with a novel four-stage training pipeline tailored for Brazilian Portuguese and comprehensive benchmarking.
Findings
Achieve favorable cost-performance trade-off
Improve legal document drafting accuracy
Enhance multi-turn dialogue quality
Abstract
This technical report presents Sabi\'a-4 and Sabiazinho-4, a new generation of Portuguese language models with a focus on Brazilian Portuguese language. The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal tasks, and function calling, and preference alignment. We evaluate the models on six benchmark categories: conversational capabilities in Brazilian Portuguese, knowledge of Brazilian legislation, long-context understanding, instruction following, standardized exams, and agentic capabilities including tool use and web navigation. Results show that Sabi\'a-4 and Sabiazinho-4 achieve a favorable cost-performance trade-off compared to other models, positioning them in the upper-left region of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
