TL;DR
This paper introduces WavAugment, a time-domain data augmentation library that significantly improves contrastive speech representation learning, outperforming previous methods and reducing data requirements.
Contribution
The paper presents WavAugment, a novel time-domain data augmentation approach that enhances contrastive predictive coding for speech representations, achieving state-of-the-art results with less data.
Findings
Augmentation in the past improves performance more than other methods.
Combining pitch, noise, and reverberation boosts CPC by 18-22%.
Outperforms Libri-light with 600x less data and matches state-of-the-art on Zero Speech Benchmark.
Abstract
Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
