PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives
Zhaowei Zhang, Xiaobo Wang, Minghua Yi, Mengmeng Wang, Fengshuo Bai, Zilong Zheng, Yipeng Kang, Yaodong Yang

TL;DR
PoliCon is a new benchmark using European Parliament records to evaluate large language models' ability to draft political consensus resolutions, revealing their limitations and biases in complex decision-making scenarios.
Contribution
Introduces PoliCon, a comprehensive benchmark and evaluation framework for assessing LLMs' capacity to generate political consensus resolutions based on diverse parliamentary contexts.
Findings
State-of-the-art LLMs struggle with complex consensus tasks.
Models exhibit partisan biases and favor dominant parties.
PoliCon effectively reveals LLMs' strengths and weaknesses in political decision-making.
Abstract
Achieving political consensus is crucial yet challenging for the effective functioning of social governance. However, although frontier AI systems represented by large language models (LLMs) have developed rapidly in recent years, their capabilities in this scope are still understudied. In this paper, we introduce PoliCon, a novel benchmark constructed from 2,225 high-quality deliberation records of the European Parliament over 13 years, ranging from 2009 to 2022, to evaluate the ability of LLMs to draft consensus resolutions based on divergent party positions under varying collective decision-making contexts and political requirements. Specifically, PoliCon incorporates four factors to build each task environment for finding different political consensus: specific political issues, political goals, participating parties, and power structures based on seat distribution. We also…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper tackles an important topic 2. The results are presented clearly
1. One of the major weaknesses of the paper is the reliability of the LLM-as-judge pipeline. LLM-as-judge has long been criticized for its generalizability, which applies to this paper as well. If I understand it correctly, the LLM-as-judge evaluation heavily relies on existing statements and voting data. However, it is really unclear whether this pipeline could create generalizable results for new statements. I believe this is the major issue of this paper. I'm happy to discuss with the authors
1. First benchmark specifically designed to evaluate LLMs' political consensus-building capabilities across diverse objectives 2. 2,225 high-quality parliamentary records with extensive cleaning and processing, integrating multiple sources 3. Diverse task settings: 15 different configurations combining party numbers, voting mechanisms, and political goals, creating 28,620 distinct scenarios
1. Using GPT-4o-mini as evaluator could introduce circular biases when testing other LLMs
1. The paper tackles a fresh and meaningful problem and it builds a solid and realistic benchmark using real parliamentary data, making the evaluation credible and grounded. 2. The evaluation setup is thoughtfully designed and connects well with social choice theory. 3. The experiments are thorough and provide clear insights into where current models perform well and where they fail.
1. The evaluation still depends on another LLM as a judge, which could introduce hidden bias. 2. The dataset only covers European Parliament data, so it might not generalize to other regions or political systems. 3. There’s no human validation to confirm that the evaluation results truly match real consensus reasoning. 4. Using LLMs in both dataset creation and evaluation could lead to subtle data leakage.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEuropean Union Policy and Governance
