On Batching Variable Size Inputs for Training End-to-End Speech   Enhancement Systems

Philippe Gonzalez; Tommy Sonne Alstr{\o}m; Tobias May

arXiv:2301.10587·cs.SD·November 9, 2023

On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems

Philippe Gonzalez, Tommy Sonne Alstr{\o}m, Tobias May

PDF

Open Access

TL;DR

This paper systematically studies how different batching strategies and batch sizes affect training efficiency and speech enhancement performance in end-to-end neural systems, revealing that small batch sizes and sorted batching can improve results and resource use.

Contribution

It provides a comprehensive analysis of batching strategies for variable-length speech inputs, highlighting effective methods for resource-efficient training without performance loss.

Findings

01

Small batch sizes improve speech enhancement performance.

02

Sorted or bucket batching with dynamic batch sizes reduces training time and memory usage.

03

Resource-efficient batching strategies maintain performance in both matched and mismatched conditions.

Abstract

The performance of neural network-based speech enhancement systems is primarily influenced by the model architecture, whereas training times and computational resource utilization are primarily affected by training parameters such as the batch size. Since noisy and reverberant speech mixtures can have different duration, a batching strategy is required to handle variable size inputs during training, in particular for state-of-the-art end-to-end systems. Such strategies usually strive for a compromise between zero-padding and data randomization, and can be combined with a dynamic batch size for a more consistent amount of data in each batch. However, the effect of these strategies on resource utilization and more importantly network performance is not well documented. This paper systematically investigates the effect of different batching strategies and batch sizes on the training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Indoor and Outdoor Localization Technologies · Speech Recognition and Synthesis