Transcribing and Translating, Fast and Slow: Joint Speech Translation   and Recognition

Niko Moritz; Ruiming Xie; Yashesh Gaur; Ke Li; Simone Merello; Zeeshan; Ahmed; Frank Seide; Christian Fuegen

arXiv:2412.15415·eess.AS·December 23, 2024

Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition

Niko Moritz, Ruiming Xie, Yashesh Gaur, Ke Li, Simone Merello, Zeeshan, Ahmed, Frank Seide, Christian Fuegen

PDF

Open Access

TL;DR

This paper introduces JSTAR, a joint speech recognition and translation model using a fast-slow encoder architecture, achieving high-quality real-time results in conversational settings with smart-glasses.

Contribution

The paper presents a novel transducer-based joint model for simultaneous speech recognition and translation, including training strategies and application to bilingual conversational speech.

Findings

01

JSTAR outperforms cascaded models in BLEU scores

02

JSTAR achieves lower latency in speech translation

03

Training a transducer-based machine translation model improves results

Abstract

We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). The model is transducer-based and uses a multi-objective training strategy that optimizes both ASR and ST objectives simultaneously. This allows JSTAR to produce high-quality streaming ASR and ST results. We apply JSTAR in a bilingual conversational speech setting with smart-glasses, where the model is also trained to distinguish speech from different directions corresponding to the wearer and a conversational partner. Different model pre-training strategies are studied to further improve results, including training of a transducer-based streaming machine translation (MT) model for the first time and applying it for parameter initialization of JSTAR. We demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques