StreamAtt: Direct Streaming Speech-to-Text Translation with   Attention-based Audio History Selection

Sara Papi; Marco Gaido; Matteo Negri; Luisa Bentivogli

arXiv:2406.06097·cs.SD·June 11, 2024

StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection

Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli

PDF

Open Access 1 Repo 1 Video

TL;DR

StreamAtt introduces a novel streaming speech-to-text translation method that efficiently manages audio history using attention mechanisms, enabling real-time translation across multiple languages with improved latency and performance.

Contribution

It presents the first StreamST policy and latency metric, addressing the gap in streaming translation research and demonstrating effectiveness across eight languages.

Findings

01

StreamAtt outperforms naive baselines in experiments.

02

StreamLAAL provides a new latency measurement for StreamST.

03

Effective multilingual streaming translation demonstrated.

Abstract

Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream. Unlike simultaneous ST (SimulST), which deals with pre-segmented speech, StreamST faces the challenges of handling continuous and unbounded audio streams. This requires additional decisions about what to retain of the previous history, which is impractical to keep entirely due to latency and computational constraints. Despite the real-world demand for real-time ST, research on streaming translation remains limited, with existing works solely focusing on SimulST. To fill this gap, we introduce StreamAtt, the first StreamST policy, and propose StreamLAAL, the first StreamST latency metric designed to be comparable with existing metrics for SimulST. Extensive experiments across all 8 languages of MuST-C v1.0 show the effectiveness of StreamAtt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hlt-mt/fbk-fairseq
pytorchOfficial

Videos

StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing