EPIC-EuroParl-UdS: Information-Theoretic Perspectives on Translation and Interpreting
Maria Kunilovskaya, Christina Pollkl\"asener

TL;DR
This paper presents an updated multilingual corpus of European Parliament speeches with enhanced annotations, supporting information-theoretic research on translation, interpretation, and language variation, and demonstrates its utility through a filler prediction study.
Contribution
It introduces a refined, multi-layered corpus combining spoken and written European Parliament data with new annotations, enabling advanced research in translation and interpretation analysis.
Findings
Validated the integrity of the spoken data after updates
Evaluated GPT-2 and MT models for filler prediction in interpreting
Demonstrated the corpus's utility for language variation studies
Abstract
This paper introduces an updated and combined version of the bidirectional English-German EPIC-UdS (spoken) and EuroParl-UdS (written) corpora containing original European Parliament speeches as well as their translations and interpretations. The new version corrects metadata and text errors identified through previous use, refines the content, updates linguistic annotations, and adds new layers, including word alignment and word-level surprisal indices. The combined resource is designed to support research using information-theoretic approaches to language variation, particularly studies comparing written and spoken modes, and examining disfluencies in speech, as well as traditional translationese studies, including parallel (source vs. target) and comparable (original vs. translated) analyses. The paper outlines the updates introduced in this release, summarises previous results based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Interpreting and Communication in Healthcare · Text Readability and Simplification
