MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines
D\'avid Javorsk\'y, Ond\v{r}ej Bojar, Fran\c{c}ois Yvon

TL;DR
MockConf is a new dataset and toolset for analyzing simultaneous interpreting, featuring multilingual recordings, alignments, and evaluation metrics to advance automatic interpretation analysis.
Contribution
The paper introduces MockConf, a novel multilingual dataset with span- and word-level alignments, and InterAlign, a web-based annotation tool for interpreting research.
Findings
Dataset includes 7 hours of multilingual interpreting recordings.
Introduces InterAlign, a tool for detailed annotation of interpreting data.
Provides baseline alignment metrics and evaluation methods.
Abstract
In simultaneous interpreting, an interpreter renders a source speech into another language with a very short lag, much sooner than sentences are finished. In order to understand and later reproduce this dynamic and complex task automatically, we need dedicated datasets and tools for analysis, monitoring, and evaluation, such as parallel speech corpora, and tools for their automatic annotation. Existing parallel corpora of translated texts and associated alignment algorithms hardly fill this gap, as they fail to model long-range interactions between speech segments or specific types of divergences (e.g., shortening, simplification, functional generalization) between the original and interpreted speeches. In this work, we introduce MockConf, a student interpreting dataset that was collected from Mock Conferences run as part of the students' curriculum. This dataset contains 7 hours of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsText Readability and Simplification · Interpreting and Communication in Healthcare · Natural Language Processing Techniques
