Agent Benchmarks Fail Public Sector Requirements
Jonathan Rystr{\o}m, Chris Schmitz, Karolina Korgul, Jan Batzner, Chris Russell

TL;DR
This paper evaluates existing benchmarks for LLM agents in the public sector, revealing none fully meet the necessary criteria of realism, process-based nature, and sector-specific metrics, urging development of better benchmarks.
Contribution
It defines key criteria for public sector benchmarks and analyzes over 1,300 papers, highlighting the gap and need for sector-specific evaluation tools.
Findings
No benchmark fully meets all criteria
Existing benchmarks lack public-sector-specific metrics
Call for development of new, relevant benchmarks
Abstract
Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public-sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first-principles survey of public administration literature: benchmarks must be \emph{process-based}, \emph{realistic}, \emph{public-sector-specific} and report \emph{metrics} that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert-validated LLM-assisted pipeline. Our results show that no single benchmark meets all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Multi-Agent Systems and Negotiation · E-Government and Public Services
