Is 42 the Answer to Everything in Subtitling-oriented Speech Translation?
Alina Karakanta, Matteo Negri, Marco Turchi

TL;DR
This paper investigates speech translation methods for subtitling, emphasizing that simply translating speech length isn't sufficient, and explores end-to-end versus cascade approaches to improve subtitle timing and segmentation.
Contribution
It compares direct end-to-end and cascade speech translation methods for subtitling, highlighting the importance of audio information over length-based assumptions.
Findings
Source speech information improves subtitle timing and segmentation.
Length alone is insufficient for effective subtitling-oriented speech translation.
End-to-end and cascade methods have different advantages in subtitling contexts.
Abstract
Subtitling is becoming increasingly important for disseminating information, given the enormous amounts of audiovisual content becoming available daily. Although Neural Machine Translation (NMT) can speed up the process of translating audiovisual content, large manual effort is still required for transcribing the source language, and for spotting and segmenting the text into proper subtitles. Creating proper subtitles in terms of timing and segmentation highly depends on information present in the audio (utterance duration, natural pauses). In this work, we explore two methods for applying Speech Translation (ST) to subtitling: a) a direct end-to-end and b) a classical cascade approach. We discuss the benefit of having access to the source language speech for improving the conformity of the generated subtitles to the spatial and temporal subtitling constraints and show that length is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
