GradeLegal: Automated Grading for German Legal Cases
Abdullah Al Zubaer, Lorenz Wendlinger, Simon Alexander Nonn, Michael Granitzer, Jelena Mitrovic

TL;DR
This study evaluates the potential of large language models to automate grading of German legal exam solutions, aiming to improve scalability and consistency in a high-stakes educational context.
Contribution
It systematically benchmarks various LLMs and prompting strategies for legal exam grading, highlighting effective approaches and the importance of prompt design.
Findings
Reasoning-oriented LLMs achieve up to 0.91 QWK in public law grading.
Ensembling models improves agreement by up to 0.15 over individual models.
Prompt design and model selection are crucial for reliable grading.
Abstract
Grading German legal exam solutions faces growing volumes and a shortage of qualified graders, delaying feedback and creating a bottleneck. At the same time, it is a high-stakes expert task, since state exam grades strongly influence career outcomes in Germany. Despite this practical relevance, literature lacks systematic studies on effective methods for grading legal exams. To address this gap, we investigate whether large language models (LLMs) can support the automated grading of German legal case solutions in criminal and public law, thereby enabling scalable feedback and student self-testing. We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK), reasoning-oriented LLMs can approximate expert…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
