Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

Ramon Pires; Thales Sales Almeida; Celio Larcher Junior; Giovana Bon\'as; Hugo Abonizio; Marcos Piau; Roseval Malaquias Junior; Thiago Laitz; Rodrigo Nogueira

arXiv:2605.08437·cs.CL·May 12, 2026

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

Ramon Pires, Thales Sales Almeida, Celio Larcher Junior, Giovana Bon\'as, Hugo Abonizio, Marcos Piau, Roseval Malaquias Junior, Thiago Laitz, Rodrigo Nogueira

PDF

TL;DR

Magis-Bench is a new benchmark assessing large language models on magistrate-level legal tasks from Brazilian exams, revealing current models' limitations in judicial reasoning and writing.

Contribution

The paper introduces Magis-Bench, a comprehensive legal benchmark with evaluation methodology and results for state-of-the-art LLMs on judicial tasks.

Findings

01

Google's Gemini-3-Pro-Preview scored highest at 6.97/10.

02

Models scored below 70% of maximum, showing room for improvement.

03

High inter-judge agreement indicates reliable evaluation.

Abstract

Existing benchmarks for legal AI focus primarily on tasks where LLMs must produce legal arguments or documents, yet the capacity to \emph{judge} such arguments -- weighing competing claims, applying doctrine to facts, and rendering reasoned decisions -- is arguably as fundamental to a well-functioning legal system as advocacy itself. We introduce Magis-Bench, a benchmark for evaluating LLMs on magistrate-level writing tasks derived from recent Brazilian competitive examinations for judicial positions. Magis-Bench comprises 74 questions from eight examinations conducted between 2023 and 2025, including discursive legal analysis questions with multi-turn structure and practical exercises requiring the composition of complete civil and criminal judicial sentences. We evaluate 23 state-of-the-art LLMs using an LLM-as-a-judge methodology with four independent frontier models as evaluators.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.