POLIS-Bench: Towards Multi-Dimensional Evaluation of LLMs for Bilingual Policy Tasks in Governmental Scenarios
Tingyue Yang, Junchi Yao, Yuhui Guo, Chang Liu

TL;DR
POLIS-Bench is a comprehensive evaluation suite for bilingual policy tasks in government scenarios, featuring an extensive corpus, scenario-grounded tasks, and a dual-metric framework, enabling better assessment and development of LLMs for policy applications.
Contribution
The paper introduces POLIS-Bench, the first systematic benchmark for bilingual governmental policy tasks, with new datasets, task design, and evaluation metrics, facilitating improved LLM assessment and fine-tuning.
Findings
Reasoning models show higher stability and accuracy across tasks.
Large-scale evaluation reveals performance hierarchy among LLMs.
Fine-tuned lightweight models match or surpass proprietary baselines.
Abstract
We introduce POLIS-Bench, the first rigorous, systematic evaluation suite designed for LLMs operating in governmental bilingual policy scenarios. Compared to existing benchmarks, POLIS-Bench introduces three major advancements. (i) Up-to-date Bilingual Corpus: We construct an extensive, up-to-date policy corpus that significantly scales the effective assessment sample size, ensuring relevance to current governance practice. (ii) Scenario-Grounded Task Design: We distill three specialized, scenario-grounded tasks -- Clause Retrieval & Interpretation, Solution Generation, and the Compliance Judgmen--to comprehensively probe model understanding and application. (iii) Dual-Metric Evaluation Framework: We establish a novel dual-metric evaluation framework combining semantic similarity with accuracy rate to precisely measure both content alignment and task requirement adherence. A large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Text Readability and Simplification
