The CitizenQuery Benchmark: A Novel Dataset and Evaluation Pipeline for Measuring LLM Performance in Citizen Query Tasks

Neil Majithia; Rajat Shinde; Zo Chapman; Prajun Trital; Jordan Decker; Manil Maskey; Elena Simperl; Nigel Shadbolt

arXiv:2602.04064·cs.CY·February 5, 2026

The CitizenQuery Benchmark: A Novel Dataset and Evaluation Pipeline for Measuring LLM Performance in Citizen Query Tasks

Neil Majithia, Rajat Shinde, Zo Chapman, Prajun Trital, Jordan Decker, Manil Maskey, Elena Simperl, Nigel Shadbolt

PDF

Open Access

TL;DR

This paper introduces CitizenQuery-UK, a large benchmark dataset for evaluating LLMs on citizen queries about government services, focusing on factuality, trustworthiness, and communication quality.

Contribution

It presents a new dataset and evaluation pipeline for measuring LLM performance in citizen query tasks, emphasizing trustworthiness and reliability in public sector applications.

Findings

01

Models show distinct performance profiles with high variance.

02

Abstention rates are low, verbosity is high, affecting reliability.

03

Trustworthiness requires acknowledging model fallibility.

Abstract

"Citizen queries" are questions asked by an individual about government policies, guidance, and services that are relevant to their circumstances, encompassing a range of topics including benefits, taxes, immigration, employment, public health, and more. This represents a compelling use case for Large Language Models (LLMs) that respond to citizen queries with information that is adapted to a user's context and communicated according to their needs. However, in this use case, any misinformation could have severe, negative, likely invisible ramifications for an individual placing their trust in a model's response. To this effect, we introduce CitizenQuery-UK, a benchmark dataset of 22 thousand pairs of citizen queries and responses that have been synthetically generated from the swathes of public information on $g o v . u k$ about government in the UK. We present the curation methodology…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · E-Government and Public Services · Mobile Crowdsensing and Crowdsourcing