Long-Form End-to-End Speech Translation via Latent Alignment Segmentation
Peter Pol\'ak, Ond\v{r}ej Bojar

TL;DR
This paper introduces a novel end-to-end speech translation method that simultaneously segments and translates speech in real-time without supervision, achieving state-of-the-art results across multiple languages.
Contribution
It presents a new segmentation approach integrated into the translation model using ST CTC, enabling low-latency, unsupervised segmentation within the same architecture.
Findings
Achieves state-of-the-art translation quality
Operates without supervision or extra parameters
Works across diverse language pairs and domains
Abstract
Current simultaneous speech translation models can process audio only up to a few seconds long. Contemporary datasets provide an oracle segmentation into sentences based on human-annotated transcripts and translations. However, the segmentation into sentences is not available in the real world. Current speech segmentation approaches either offer poor segmentation quality or have to trade latency for quality. In this paper, we propose a novel segmentation approach for a low-latency end-to-end speech translation. We leverage the existing speech translation encoder-decoder architecture with ST CTC and show that it can perform the segmentation task without supervision or additional parameters. To the best of our knowledge, our method is the first that allows an actual end-to-end simultaneous speech translation, as the same model is used for translation and segmentation at the same time. On…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
