POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios

Tingyue Yang; Junchi Yao; Yuhui Guo; Chang Liu

arXiv:2511.04705·cs.CL·November 10, 2025

POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios

Tingyue Yang, Junchi Yao, Yuhui Guo, Chang Liu

PDF

Open Access

TL;DR

POLIS-Bench is a comprehensive evaluation suite for bilingual policy tasks in government scenarios, featuring an extensive corpus, scenario-grounded tasks, and a dual-metric framework, enabling better assessment and development of LLMs for policy applications.

Contribution

The paper introduces POLIS-Bench, the first systematic benchmark for bilingual governmental policy tasks, with new datasets, task design, and evaluation metrics, facilitating improved LLM assessment and fine-tuning.

Findings

01

Reasoning models show higher stability and accuracy across tasks.

02

Large-scale evaluation reveals performance hierarchy among LLMs.

03

Fine-tuned lightweight models match or surpass proprietary baselines.

Abstract

We introduce POLIS-Bench, the first rigorous, systematic evaluation suite designed for LLMs operating in governmental bilingual policy scenarios. Compared to existing benchmarks, POLIS-Bench introduces three major advancements. (i) Up-to-date Bilingual Corpus: We construct an extensive, up-to-date policy corpus that significantly scales the effective assessment sample size, ensuring relevance to current governance practice. (ii) Scenario-Grounded Task Design: We distill three specialized, scenario-grounded tasks -- Clause Retrieval & Interpretation, Solution Generation, and the Compliance Judgmen--to comprehensively probe model understanding and application. (iii) Dual-Metric Evaluation Framework: We establish a novel dual-metric evaluation framework combining semantic similarity with accuracy rate to precisely measure both content alignment and task requirement adherence. A large-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Text Readability and Simplification