Learning When to Translate for Streaming Speech

Qianqian Dong; Yaoming Zhu; Mingxuan Wang; Lei Li

arXiv:2109.07368·cs.CL·March 23, 2022·1 cites

Learning When to Translate for Streaming Speech

Qianqian Dong, Yaoming Zhu, Mingxuan Wang, Lei Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces MoSST, a monotonic segmentation method for streaming speech translation that improves the timing of partial translations by detecting speech unit boundaries, leading to better quality-latency trade-offs.

Contribution

MoSST is a novel monotonic segmentation approach integrated into speech translation models, enhancing boundary detection and translation performance for streaming input.

Findings

01

MoSST outperforms existing streaming translation methods.

02

It achieves a better balance between translation quality and latency.

03

Experiments on MuST-C demonstrate its effectiveness.

Abstract

How to find proper moments to generate partial sentence translation given a streaming speech input? Existing approaches waiting-and-translating for a fixed duration often break the acoustic units in speech, since the boundaries between acoustic units in speech are not even. In this paper, we propose MoSST, a simple yet effective method for translating streaming speech content. Given a usually long speech sequence, we develop an efficient monotonic segmentation module inside an encoder-decoder model to accumulate acoustic information incrementally and detect proper speech unit boundaries for the input in speech translation task. Experiments on multiple translation directions of the MuST-C dataset show that MoSST outperforms existing methods and achieves the best trade-off between translation quality (BLEU) and latency. Our code is available at https://github.com/dqqcasia/mosst.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dqqcasia/mosst
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling