Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation
Chantal Amrhein, Barry Haddow

TL;DR
This paper demonstrates that fixed-window audio segmentation can be surprisingly effective for speech-to-text translation, especially when combined with appropriate strategies, challenging the assumption that more complex segmentation methods are always necessary.
Contribution
The study systematically compares segmentation strategies in speech translation, highlighting the effectiveness of simple fixed-window segmentation in both offline and online scenarios.
Findings
Fixed-window segmentation performs well with proper conditions.
Robustness to segmentation errors improves translation quality.
Simple methods can outperform complex segmentation strategies.
Abstract
For real-life applications, it is crucial that end-to-end spoken language translation models perform well on continuous audio, without relying on human-supplied segmentation. For online spoken language translation, where models need to start translating before the full utterance is spoken, most previous work has ignored the segmentation problem. In this paper, we compare various methods for improving models' robustness towards segmentation errors and different segmentation strategies in both offline and online settings and report results on translation quality, flicker and delay. Our findings on five different language pairs show that a simple fixed-window audio segmentation can perform surprisingly well given the right conditions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Speech Recognition and Synthesis
