Political corpus creation through automatic speech recognition on EU   debates

Hugo de Vos; Suzan Verberne

arXiv:2304.08137·cs.CL·April 18, 2023·1 cites

Political corpus creation through automatic speech recognition on EU debates

Hugo de Vos, Suzan Verberne

PDF

Open Access 1 Repo

TL;DR

This paper develops an accurate automatic speech recognition pipeline for EU parliamentary debates, creating a large transcribed corpus to facilitate political research and analysis.

Contribution

It introduces a domain-specific ASR model using unsupervised adaptation of Wav2vec2.0, significantly improving transcription accuracy for EU debate recordings.

Findings

01

Domain-specific acoustic and language models reduce WER from 28.22% to 17.95%.

02

Adding domain-specific terms did not improve ASR performance.

03

The corpus enables effective topic modeling for political analysis.

Abstract

In this paper, we present a transcribed corpus of the LIBE committee of the EU parliament, totalling 3.6 Million running words. The meetings of parliamentary committees of the EU are a potentially valuable source of information for political scientists but the data is not readily available because only disclosed as speech recordings together with limited metadata. The meetings are in English, partly spoken by non-native speakers, and partly spoken by interpreters. We investigated the most appropriate Automatic Speech Recognition (ASR) model to create an accurate text transcription of the audio recordings of the meetings in order to make their content available for research and analysis. We focused on the unsupervised domain adaptation of the ASR pipeline. Building on the transformer-based Wav2vec2.0 model, we experimented with multiple acoustic models, language models and the addition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hdvos/euparliamentasrdataandcode
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques