Streaming Models for Joint Speech Recognition and Translation
Orion Weller, Matthias Sperber, Christian Gollan, Joris, Kluivers

TL;DR
This paper presents a streaming end-to-end speech translation model that produces both transcripts and translations simultaneously, matching the performance of cascaded systems with fewer parameters and low latency.
Contribution
It introduces a novel re-translation based streaming model with an interleaved inference method, enabling joint transcript and translation output without separate decoders.
Findings
End-to-end streaming models achieve similar accuracy to cascaded systems.
Models operate with half the parameters of traditional approaches.
Both systems maintain high translation quality with low latency, under one second.
Abstract
Using end-to-end models for speech translation (ST) has increasingly been the focus of the ST community. These models condense the previously cascaded systems by directly converting sound waves into translated text. However, cascaded models have the advantage of including automatic speech recognition output, useful for a variety of practical ST systems that often display transcripts to the user alongside the translations. To bridge this gap, recent work has shown initial progress into the feasibility for end-to-end models to produce both of these outputs. However, all previous work has only looked at this problem from the consecutive perspective, leaving uncertainty on whether these approaches are effective in the more challenging streaming setting. We develop an end-to-end streaming ST model based on a re-translation approach and compare against standard cascading approaches. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
