End-to-End Simultaneous Speech Translation with Differentiable   Segmentation

Shaolei Zhang; Yang Feng

arXiv:2305.16093·cs.CL·November 13, 2023·1 cites

End-to-End Simultaneous Speech Translation with Differentiable Segmentation

Shaolei Zhang, Yang Feng

PDF

Open Access 1 Repo

TL;DR

This paper introduces Differentiable Segmentation (DiSeg), a novel method for end-to-end simultaneous speech translation that learns optimal segmentation directly from the translation model, improving translation quality and segmentation accuracy.

Contribution

DiSeg enables joint training of segmentation and translation, overcoming limitations of fixed or external segmentation methods in SimulST.

Findings

01

DiSeg achieves state-of-the-art translation performance.

02

DiSeg demonstrates superior segmentation capability.

03

Joint training improves translation quality.

Abstract

End-to-end simultaneous speech translation (SimulST) outputs translation while receiving the streaming speech inputs (a.k.a. streaming speech translation), and hence needs to segment the speech inputs and then translate based on the current received speech. However, segmenting the speech inputs at unfavorable moments can disrupt the acoustic integrity and adversely affect the performance of the translation model. Therefore, learning to segment the speech inputs at those moments that are beneficial for the translation model to produce high-quality translation is the key to SimulST. Existing SimulST methods, either using the fixed-length segmentation or external segmentation model, always separate segmentation from the underlying translation model, where the gap results in segmentation outcomes that are not necessarily beneficial for the translation process. In this paper, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ictnlp/diseg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing