Revisiting Interpolation Augmentation for Speech-to-Text Generation
Chen Xu, Jie Wang, Xiaoqian Liu, Qianqian Dong, Chunliang Zhang, Tong, Xiao, Jingbo Zhu, Dapeng Man, Wu Yang

TL;DR
This paper explores the use of interpolation augmentation in speech-to-text systems, demonstrating that proper implementation improves performance especially in low-resource scenarios, across various models and datasets.
Contribution
It provides a comprehensive analysis of interpolation augmentation's effectiveness in S2T tasks, which was previously under-explored, and offers guidelines for its optimal application.
Findings
Interpolation augmentation significantly improves S2T performance.
Effectiveness is consistent across different architectures and data scales.
Proper strategy selection is crucial for maximizing benefits.
Abstract
Speech-to-text (S2T) generation systems frequently face challenges in low-resource scenarios, primarily due to the lack of extensive labeled datasets. One emerging solution is constructing virtual training samples by interpolating inputs and labels, which has notably enhanced system generalization in other domains. Despite its potential, this technique's application in S2T tasks has remained under-explored. In this paper, we delve into the utility of interpolation augmentation, guided by several pivotal questions. Our findings reveal that employing an appropriate strategy in interpolation augmentation significantly enhances performance across diverse tasks, architectures, and data scales, offering a promising avenue for more robust S2T systems in resource-constrained settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
