Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text   Translation

Chantal Amrhein; Barry Haddow

arXiv:2210.13363·cs.CL·October 25, 2022·1 cites

Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation

Chantal Amrhein, Barry Haddow

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that fixed-window audio segmentation can be surprisingly effective for speech-to-text translation, especially when combined with appropriate strategies, challenging the assumption that more complex segmentation methods are always necessary.

Contribution

The study systematically compares segmentation strategies in speech translation, highlighting the effectiveness of simple fixed-window segmentation in both offline and online scenarios.

Findings

01

Fixed-window segmentation performs well with proper conditions.

02

Robustness to segmentation errors improves translation quality.

03

Simple methods can outperform complex segmentation strategies.

Abstract

For real-life applications, it is crucial that end-to-end spoken language translation models perform well on continuous audio, without relying on human-supplied segmentation. For online spoken language translation, where models need to start translating before the full utterance is spoken, most previous work has ignored the segmentation problem. In this paper, we compare various methods for improving models' robustness towards segmentation errors and different segmentation strategies in both offline and online settings and report results on translation quality, flicker and delay. Our findings on five different language pairs show that a simple fixed-window audio segmentation can perform surprisingly well given the right conditions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zurichnlp/window_audio_segmentation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Speech Recognition and Synthesis