BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks
Sebastian Nagl, Matthias Grabmair

TL;DR
BenGER is an open-source web platform designed to streamline and improve the benchmarking of German legal language models through collaborative, transparent, and configurable workflows.
Contribution
It introduces a comprehensive, integrated platform for task creation, annotation, model evaluation, and analysis tailored for legal AI benchmarking.
Findings
Supports multi-organization projects with role-based access
Enables end-to-end benchmarking from task creation to analysis
Provides various metrics including lexical, semantic, factual, and judge-based
Abstract
Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
