Political corpus creation through automatic speech recognition on EU debates
Hugo de Vos, Suzan Verberne

TL;DR
This paper develops an accurate automatic speech recognition pipeline for EU parliamentary debates, creating a large transcribed corpus to facilitate political research and analysis.
Contribution
It introduces a domain-specific ASR model using unsupervised adaptation of Wav2vec2.0, significantly improving transcription accuracy for EU debate recordings.
Findings
Domain-specific acoustic and language models reduce WER from 28.22% to 17.95%.
Adding domain-specific terms did not improve ASR performance.
The corpus enables effective topic modeling for political analysis.
Abstract
In this paper, we present a transcribed corpus of the LIBE committee of the EU parliament, totalling 3.6 Million running words. The meetings of parliamentary committees of the EU are a potentially valuable source of information for political scientists but the data is not readily available because only disclosed as speech recordings together with limited metadata. The meetings are in English, partly spoken by non-native speakers, and partly spoken by interpreters. We investigated the most appropriate Automatic Speech Recognition (ASR) model to create an accurate text transcription of the audio recordings of the meetings in order to make their content available for research and analysis. We focused on the unsupervised domain adaptation of the ASR pipeline. Building on the transformer-based Wav2vec2.0 model, we experimented with multiple acoustic models, language models and the addition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
