Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
Ramon Pires, Thales Sales Almeida, Celio Larcher Junior, Giovana Bon\'as, Hugo Abonizio, Marcos Piau, Roseval Malaquias Junior, Thiago Laitz, Rodrigo Nogueira

TL;DR
Magis-Bench is a new benchmark assessing large language models on magistrate-level legal tasks from Brazilian exams, revealing current models' limitations in judicial reasoning and writing.
Contribution
The paper introduces Magis-Bench, a comprehensive legal benchmark with evaluation methodology and results for state-of-the-art LLMs on judicial tasks.
Findings
Google's Gemini-3-Pro-Preview scored highest at 6.97/10.
Models scored below 70% of maximum, showing room for improvement.
High inter-judge agreement indicates reliable evaluation.
Abstract
Existing benchmarks for legal AI focus primarily on tasks where LLMs must produce legal arguments or documents, yet the capacity to \emph{judge} such arguments -- weighing competing claims, applying doctrine to facts, and rendering reasoned decisions -- is arguably as fundamental to a well-functioning legal system as advocacy itself. We introduce Magis-Bench, a benchmark for evaluating LLMs on magistrate-level writing tasks derived from recent Brazilian competitive examinations for judicial positions. Magis-Bench comprises 74 questions from eight examinations conducted between 2023 and 2025, including discursive legal analysis questions with multi-turn structure and practical exercises requiring the composition of complete civil and criminal judicial sentences. We evaluate 23 state-of-the-art LLMs using an LLM-as-a-judge methodology with four independent frontier models as evaluators.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
