GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations

Odysseas S. Chlapanis; Dimitrios Galanis; Nikolaos Aletras; Ion Androutsopoulos

arXiv:2505.17267·cs.CL·November 4, 2025

GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations

Odysseas S. Chlapanis, Dimitrios Galanis, Nikolaos Aletras, Ion Androutsopoulos

PDF

1 Datasets 2 Videos

TL;DR

GreekBarBench is a new benchmark for testing large language models on Greek legal questions, emphasizing citation accuracy and evaluation methods, revealing current models' performance gaps compared to top human experts.

Contribution

Introduces GreekBarBench, a comprehensive legal reasoning benchmark with a novel scoring system and meta-evaluation, advancing assessment of LLMs in legal contexts.

Findings

01

Best models outperform average experts but not top 5%

02

Simple span-based rubrics improve LLM-human alignment

03

Systematic evaluation across 13 LLMs highlights performance gaps

Abstract

We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our systematic evaluation of 13 proprietary and open-weight LLMs shows that even though the best models outperform average expert scores, they fall short of the 95th percentile of experts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AUEB-NLP/greek-bar-bench
dataset· 62 dl
62 dl

Videos

GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations· underline