Alvorada-Bench: Can Language Models Solve Brazilian University Entrance Exams?
Henrique Godoy

TL;DR
This paper introduces Alvorada-Bench, a comprehensive Brazilian university entrance exam benchmark, evaluating language models' accuracy, reasoning, and self-assessment capabilities across diverse subjects and prompting methods.
Contribution
It provides the first large-scale, culturally relevant benchmark for Brazilian exams, analyzing model performance, reasoning skills, and self-confidence calibration in a real-world educational context.
Findings
Top models exceed 94% accuracy overall
Accuracy drops on Mathematics and engineering exams
Models can reliably assess their own confidence
Abstract
Language models are increasingly used in Brazil, but most evaluation remains English-centric. This paper presents Alvorada-Bench, a 4,515-question, text-only benchmark drawn from five Brazilian university entrance examinations. Evaluating twenty models under zero-shot, role-playing, and chain-of-thought prompting, producing 270,900 responses with structured self-reports of confidence, perceived difficulty, and Bloom level. The top models exceed 94% accuracy overall, but accuracy declines on Mathematics and on the engineering oriented IME and ITA exams, indicating persistent weaknesses in multi-step reasoning. Confidence is well calibrated and correlates with perceived difficulty, revealing that models can accurately assess their own certainty capabilities. A cost accuracy analysis shows that high accuracy is achievable at under $2 per 1K tokens. On ENEM 2024 the top model (O3) achieved…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
