The GDN-CC Dataset: Automatic Corpus Clarification for AI-enhanced Democratic Citizen Consultations
Pierre-Antoine Lequeu, L\'eo Labat, Laur\`ene Cave, Ga\"el Lejeune, Fran\c{c}ois Yvon, Benjamin Piwowarski

TL;DR
This paper introduces GDN-CC, a dataset and framework for standardizing and analyzing citizen contributions in democratic consultations using small, open-weight language models for annotation and clarification.
Contribution
The paper presents GDN-CC, a new manually-curated dataset and a preprocessing framework for transforming noisy consultation data into structured argumentative units.
Findings
Finetuned small language models match or outperform larger LLMs in annotation tasks.
The GDN-CC-large dataset contains 240k automatically annotated contributions.
The framework improves the usability of democratic consultation data for analysis.
Abstract
LLMs are ubiquitous in modern NLP, and while their applicability extends to texts produced for democratic activities such as online deliberations or large-scale citizen consultations, ethical questions have been raised for their usage as analysis tools. We continue this line of research with two main goals: (a) to develop resources that can help standardize citizen contributions in public forums at the pragmatic level, and make them easier to use in topic modeling and political analysis; (b) to study how well this standardization can reliably be performed by small, open-weights LLMs, i.e. models that can be run locally and transparently with limited resources. Accordingly, we introduce Corpus Clarification as a preprocessing framework for large-scale consultation data that transforms noisy, multi-topic contributions into structured, self-contained argumentative units ready for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
