TL;DR
This paper systematically examines how different sampling strategies, considering linguistic and speaker distributions, affect the performance of Siamese networks in unsupervised speech representation learning, leading to improved results.
Contribution
It highlights the importance of sampling procedures in Siamese networks and demonstrates that strategies considering Zipf's Law and speaker distribution significantly enhance learning.
Findings
Sampling strategies based on Zipf's Law improve performance.
Word frequency compression benefits learning across various training sizes.
Applying these strategies to unsupervised word pairs improves state-of-the-art results.
Abstract
Recent studies have investigated siamese network architectures for learning invariant speech representations using same-different side information at the word level. Here we investigate systematically an often ignored component of siamese networks: the sampling procedure (how pairs of same vs. different tokens are selected). We show that sampling strategies taking into account Zipf's Law, the distribution of speakers and the proportions of same and different pairs of words significantly impact the performance of the network. In particular, we show that word frequency compression improves learning across a large range of variations in number of training pairs. This effect does not apply to the same extent to the fully unsupervised setting, where the pairs of same-different words are obtained by spoken term discovery. We apply these results to pairs of words discovered using an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSiamese Network
