PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

Zhaowei Zhang; Xiaobo Wang; Minghua Yi; Mengmeng Wang; Fengshuo Bai; Zilong Zheng; Yipeng Kang; Yaodong Yang

arXiv:2505.19558·cs.CY·February 16, 2026

PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

Zhaowei Zhang, Xiaobo Wang, Minghua Yi, Mengmeng Wang, Fengshuo Bai, Zilong Zheng, Yipeng Kang, Yaodong Yang

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

PoliCon is a new benchmark using European Parliament records to evaluate large language models' ability to draft political consensus resolutions, revealing their limitations and biases in complex decision-making scenarios.

Contribution

Introduces PoliCon, a comprehensive benchmark and evaluation framework for assessing LLMs' capacity to generate political consensus resolutions based on diverse parliamentary contexts.

Findings

01

State-of-the-art LLMs struggle with complex consensus tasks.

02

Models exhibit partisan biases and favor dominant parties.

03

PoliCon effectively reveals LLMs' strengths and weaknesses in political decision-making.

Abstract

Achieving political consensus is crucial yet challenging for the effective functioning of social governance. However, although frontier AI systems represented by large language models (LLMs) have developed rapidly in recent years, their capabilities in this scope are still understudied. In this paper, we introduce PoliCon, a novel benchmark constructed from 2,225 high-quality deliberation records of the European Parliament over 13 years, ranging from 2009 to 2022, to evaluate the ability of LLMs to draft consensus resolutions based on divergent party positions under varying collective decision-making contexts and political requirements. Specifically, PoliCon incorporates four factors to build each task environment for finding different political consensus: specific political issues, political goals, participating parties, and power structures based on seat distribution. We also…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 5

Strengths

1. This paper tackles an important topic 2. The results are presented clearly

Weaknesses

1. One of the major weaknesses of the paper is the reliability of the LLM-as-judge pipeline. LLM-as-judge has long been criticized for its generalizability, which applies to this paper as well. If I understand it correctly, the LLM-as-judge evaluation heavily relies on existing statements and voting data. However, it is really unclear whether this pipeline could create generalizable results for new statements. I believe this is the major issue of this paper. I'm happy to discuss with the authors

Reviewer 02Rating 8Confidence 3

Strengths

1. First benchmark specifically designed to evaluate LLMs' political consensus-building capabilities across diverse objectives 2. 2,225 high-quality parliamentary records with extensive cleaning and processing, integrating multiple sources 3. Diverse task settings: 15 different configurations combining party numbers, voting mechanisms, and political goals, creating 28,620 distinct scenarios

Weaknesses

1. Using GPT-4o-mini as evaluator could introduce circular biases when testing other LLMs

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper tackles a fresh and meaningful problem and it builds a solid and realistic benchmark using real parliamentary data, making the evaluation credible and grounded. 2. The evaluation setup is thoughtfully designed and connects well with social choice theory. 3. The experiments are thorough and provide clear insights into where current models perform well and where they fail.

Weaknesses

1. The evaluation still depends on another LLM as a judge, which could introduce hidden bias. 2. The dataset only covers European Parliament data, so it might not generalize to other regions or political systems. 3. There’s no human validation to confirm that the evaluation results truly match real consensus reasoning. 4. Using LLMs in both dataset creation and evaluation could lead to subtle data leakage.

Code & Models

Datasets

Yofuria/PoliCon
dataset· 88 dl
88 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEuropean Union Policy and Governance