TL;DR
This paper presents ParlaCAP, a large multilingual dataset of European parliamentary speeches, and a scalable LLM-based method for classifying policy topics, enabling comparative political analysis.
Contribution
It introduces a novel dataset and a cost-effective, scalable classification method using LLMs, improving domain-specific policy topic annotation across multiple languages.
Findings
The LLM-based classifier matches human agreement levels.
The classifier outperforms existing out-of-domain CAP classifiers.
The dataset enables analysis of political attention, sentiment, and gender differences.
Abstract
This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
