Long-Form End-to-End Speech Translation via Latent Alignment   Segmentation

Peter Pol\'ak; Ond\v{r}ej Bojar

arXiv:2309.11384·cs.CL·October 28, 2024·1 cites

Long-Form End-to-End Speech Translation via Latent Alignment Segmentation

Peter Pol\'ak, Ond\v{r}ej Bojar

PDF

Open Access

TL;DR

This paper introduces a novel end-to-end speech translation method that simultaneously segments and translates speech in real-time without supervision, achieving state-of-the-art results across multiple languages.

Contribution

It presents a new segmentation approach integrated into the translation model using ST CTC, enabling low-latency, unsupervised segmentation within the same architecture.

Findings

01

Achieves state-of-the-art translation quality

02

Operates without supervision or extra parameters

03

Works across diverse language pairs and domains

Abstract

Current simultaneous speech translation models can process audio only up to a few seconds long. Contemporary datasets provide an oracle segmentation into sentences based on human-annotated transcripts and translations. However, the segmentation into sentences is not available in the real world. Current speech segmentation approaches either offer poor segmentation quality or have to trade latency for quality. In this paper, we propose a novel segmentation approach for a low-latency end-to-end speech translation. We leverage the existing speech translation encoder-decoder architecture with ST CTC and show that it can perform the segmentation task without supervision or additional parameters. To the best of our knowledge, our method is the first that allows an actual end-to-end simultaneous speech translation, as the same model is used for translation and segmentation at the same time. On…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing