StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection
Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli

TL;DR
StreamAtt introduces a novel streaming speech-to-text translation method that efficiently manages audio history using attention mechanisms, enabling real-time translation across multiple languages with improved latency and performance.
Contribution
It presents the first StreamST policy and latency metric, addressing the gap in streaming translation research and demonstrating effectiveness across eight languages.
Findings
StreamAtt outperforms naive baselines in experiments.
StreamLAAL provides a new latency measurement for StreamST.
Effective multilingual streaming translation demonstrated.
Abstract
Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream. Unlike simultaneous ST (SimulST), which deals with pre-segmented speech, StreamST faces the challenges of handling continuous and unbounded audio streams. This requires additional decisions about what to retain of the previous history, which is impractical to keep entirely due to latency and computational constraints. Despite the real-world demand for real-time ST, research on streaming translation remains limited, with existing works solely focusing on SimulST. To fill this gap, we introduce StreamAtt, the first StreamST policy, and propose StreamLAAL, the first StreamST latency metric designed to be comparable with existing metrics for SimulST. Extensive experiments across all 8 languages of MuST-C v1.0 show the effectiveness of StreamAtt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
