On the Impact of Noises in Crowd-Sourced Data for Speech Translation

Siqi Ouyang; Rong Ye; Lei Li

arXiv:2206.13756·cs.CL·July 4, 2022

On the Impact of Noises in Crowd-Sourced Data for Speech Translation

Siqi Ouyang, Rong Ye, Lei Li

PDF

1 Repo

TL;DR

This paper investigates how noise and quality issues in crowd-sourced speech translation datasets affect model training and evaluation, proposing automatic filtering methods to improve data quality and model performance.

Contribution

It identifies key quality issues in MuST-C dataset and introduces an automatic filtering approach to enhance speech translation model training.

Findings

01

Models perform better on cleaner test sets.

02

Removing misaligned data alone does not improve models.

03

Data quality significantly impacts model evaluation.

Abstract

Training speech translation (ST) models requires large and high-quality datasets. MuST-C is one of the most widely used ST benchmark datasets. It contains around 400 hours of speech-transcript-translation data for each of the eight translation directions. This dataset passes several quality-control filters during creation. However, we find that MuST-C still suffers from three major quality issues: audio-text misalignment, inaccurate translation, and unnecessary speaker's name. What are the impacts of these data quality issues for model development and evaluation? In this paper, we propose an automatic method to fix or filter the above quality issues, using English-German (En-De) translation as an example. Our experiments show that ST models perform better on clean test sets, and the rank of proposed models remains consistent across different test sets. Besides, simply removing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

owaski/must-c-clean
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTest