End-to-End Simultaneous Speech Translation with Differentiable Segmentation
Shaolei Zhang, Yang Feng

TL;DR
This paper introduces Differentiable Segmentation (DiSeg), a novel method for end-to-end simultaneous speech translation that learns optimal segmentation directly from the translation model, improving translation quality and segmentation accuracy.
Contribution
DiSeg enables joint training of segmentation and translation, overcoming limitations of fixed or external segmentation methods in SimulST.
Findings
DiSeg achieves state-of-the-art translation performance.
DiSeg demonstrates superior segmentation capability.
Joint training improves translation quality.
Abstract
End-to-end simultaneous speech translation (SimulST) outputs translation while receiving the streaming speech inputs (a.k.a. streaming speech translation), and hence needs to segment the speech inputs and then translate based on the current received speech. However, segmenting the speech inputs at unfavorable moments can disrupt the acoustic integrity and adversely affect the performance of the translation model. Therefore, learning to segment the speech inputs at those moments that are beneficial for the translation model to produce high-quality translation is the key to SimulST. Existing SimulST methods, either using the fixed-length segmentation or external segmentation model, always separate segmentation from the underlying translation model, where the gap results in segmentation outcomes that are not necessarily beneficial for the translation process. In this paper, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
