Alvorada-Bench: Can Language Models Solve Brazilian University Entrance Exams?

Henrique Godoy

arXiv:2508.15835·cs.CL·August 25, 2025

Alvorada-Bench: Can Language Models Solve Brazilian University Entrance Exams?

Henrique Godoy

PDF

1 Datasets

TL;DR

This paper introduces Alvorada-Bench, a comprehensive Brazilian university entrance exam benchmark, evaluating language models' accuracy, reasoning, and self-assessment capabilities across diverse subjects and prompting methods.

Contribution

It provides the first large-scale, culturally relevant benchmark for Brazilian exams, analyzing model performance, reasoning skills, and self-confidence calibration in a real-world educational context.

Findings

01

Top models exceed 94% accuracy overall

02

Accuracy drops on Mathematics and engineering exams

03

Models can reliably assess their own confidence

Abstract

Language models are increasingly used in Brazil, but most evaluation remains English-centric. This paper presents Alvorada-Bench, a 4,515-question, text-only benchmark drawn from five Brazilian university entrance examinations. Evaluating twenty models under zero-shot, role-playing, and chain-of-thought prompting, producing 270,900 responses with structured self-reports of confidence, perceived difficulty, and Bloom level. The top models exceed 94% accuracy overall, but accuracy declines on Mathematics and on the engineering oriented IME and ITA exams, indicating persistent weaknesses in multi-step reasoning. Confidence is well calibrated and correlates with perceived difficulty, revealing that models can accurately assess their own certainty capabilities. A cost accuracy analysis shows that high accuracy is achievable at under $2 per 1K tokens. On ENEM 2024 the top model (O3) achieved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

HenriqueGodoy/Alvorada-bench
dataset· 19 dl
19 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.