NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus
Enzo S. N. Silva, Pablo B. Costa, Raphael C. Vlasman, Rosimeire P. Costa, Henrique L. P. Silva, Lucas F. A. O. Pellicer, Guilherme Rinaldo, Renato A. Almeida, Darian S. R. Rabbani, Cinthya O. Oestreich, Vinicius F. Carid\'a

TL;DR
NorBERTo is a new Portuguese language model trained on the largest openly available corpus, achieving state-of-the-art results on several NLP benchmarks and designed for practical deployment.
Contribution
Introduces NorBERTo, a modern encoder based on ModernBERT trained on Aurora-PT, the largest Portuguese corpus, with improved benchmark performance and deployment efficiency.
Findings
NorBERTo-large achieves 0.9191 F1 on MRPC.
NorBERTo-large attains 0.7689 accuracy on RTE.
Aurora-PT surpasses previous Portuguese corpora in size.
Abstract
High-quality corpora are essential for advancing Natural Language Processing (NLP) in Portuguese. Building on previous encoder-only models such as BERTimbau and Albertina PT-BR, we introduce NorBERTo, a modern encoder based on the ModernBERT architecture, featuring long-context support and efficient attention mechanisms. NorBERTo is trained on Aurora-PT, a newly curated Brazilian Portuguese corpus comprising 331 billion GPT-2 tokens collected from diverse web sources and existing multilingual datasets. We systematically benchmark NorBERTo against Strong baselines on semantic similarity, textual entailment and classification tasks using standardized datasets such as ASSIN 2 and PLUE. On PLUE, NorBERTo-large achieves the best results among the encoder models we evaluated, notably reaching 0.9191 F1 on MRPC and 0.7689 accuracy on RTE. On ASSIN 2, NorBERTo-large attains the highest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
