PoETa v2: Toward More Robust Evaluation of Large Language Models in Portuguese
Thales Sales Almeida, Ramon Pires, Hugo Abonizio, Rodrigo Nogueira, H\'elio Pedrini

TL;DR
PoETa v2 is a comprehensive benchmark that evaluates large language models in Portuguese across over 40 tasks, revealing insights into performance variations due to computational resources and language-specific factors.
Contribution
This work introduces PoETa v2, the most extensive Portuguese LLM evaluation benchmark, enabling systematic analysis of model performance and gaps compared to English.
Findings
Performance varies significantly with computational investment.
Language-specific adaptation impacts model effectiveness.
Identifies performance gaps between Portuguese and English tasks.
Abstract
Large Language Models (LLMs) exhibit significant variations in performance across linguistic and cultural contexts, underscoring the need for systematic evaluation in diverse languages. In this work, we present the most extensive evaluation of LLMs for the Portuguese language to date. Leveraging our newly introduced PoETa v2 benchmark -- a comprehensive suite of over 40 tasks in Portuguese -- we assess more than 20 models covering a broad spectrum of training scales and computational resources. Our study reveals how computational investment and language-specific adaptation impact performance in Portuguese, while also analyzing performance gaps in comparison to equivalent tasks in English. Through this benchmark and analysis, PoETa v2 lays the groundwork for future research on Portuguese language modeling and evaluation. The benchmark is available at https://github.com/PoETaV2/PoETaV2.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
