CTC-GMM: CTC guided modality matching for fast and accurate streaming   speech translation

Rui Zhao; Jinyu Li; Ruchao Fan; Matt Post

arXiv:2410.05146·cs.CL·October 8, 2024

CTC-GMM: CTC guided modality matching for fast and accurate streaming speech translation

Rui Zhao, Jinyu Li, Ruchao Fan, Matt Post

PDF

Open Access

TL;DR

This paper introduces CTC-GMM, a novel method that leverages machine translation data and CTC-based modality matching to improve the speed and accuracy of streaming speech translation models.

Contribution

The paper presents a new CTC-guided modality matching technique that effectively utilizes MT text data to enhance streaming speech translation performance.

Findings

01

Achieves up to 13.9% relative increase in translation accuracy

02

Boosts decoding speed by 59.7% on GPU

03

Effective use of MT data for streaming speech translation

Abstract

Models for streaming speech translation (ST) can achieve high accuracy and low latency if they're developed with vast amounts of paired audio in the source language and written text in the target language. Yet, these text labels for the target language are often pseudo labels due to the prohibitive cost of manual ST data labeling. In this paper, we introduce a methodology named Connectionist Temporal Classification guided modality matching (CTC-GMM) that enhances the streaming ST model by leveraging extensive machine translation (MT) text data. This technique employs CTC to compress the speech sequence into a compact embedding sequence that matches the corresponding text sequence, allowing us to utilize matched {source-target} language text pairs from the MT corpora to refine the streaming ST model further. Our evaluations with FLEURS and CoVoST2 show that the CTC-GMM approach can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings