BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

Sebastian Nagl; Matthias Grabmair

arXiv:2604.13583·cs.CL·April 23, 2026

BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

Sebastian Nagl, Matthias Grabmair

PDF

TL;DR

BenGER is an open-source web platform designed to streamline and improve the benchmarking of German legal language models through collaborative, transparent, and configurable workflows.

Contribution

It introduces a comprehensive, integrated platform for task creation, annotation, model evaluation, and analysis tailored for legal AI benchmarking.

Findings

01

Supports multi-organization projects with role-based access

02

Enables end-to-end benchmarking from task creation to analysis

03

Provides various metrics including lexical, semantic, factual, and judge-based

Abstract

Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.