Agent Benchmarks Fail Public Sector Requirements

Jonathan Rystr{\o}m; Chris Schmitz; Karolina Korgul; Jan Batzner; Chris Russell

arXiv:2601.20617·cs.CY·January 29, 2026

Agent Benchmarks Fail Public Sector Requirements

Jonathan Rystr{\o}m, Chris Schmitz, Karolina Korgul, Jan Batzner, Chris Russell

PDF

Open Access

TL;DR

This paper evaluates existing benchmarks for LLM agents in the public sector, revealing none fully meet the necessary criteria of realism, process-based nature, and sector-specific metrics, urging development of better benchmarks.

Contribution

It defines key criteria for public sector benchmarks and analyzes over 1,300 papers, highlighting the gap and need for sector-specific evaluation tools.

Findings

01

No benchmark fully meets all criteria

02

Existing benchmarks lack public-sector-specific metrics

03

Call for development of new, relevant benchmarks

Abstract

Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public-sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first-principles survey of public administration literature: benchmarks must be \emph{process-based}, \emph{realistic}, \emph{public-sector-specific} and report \emph{metrics} that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert-validated LLM-assisted pipeline. Our results show that no single benchmark meets all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Multi-Agent Systems and Negotiation · E-Government and Public Services