Use of speaker recognition approaches for learning and evaluating embedding representations of musical instrument sounds
Xuan Shi, Erica Cooper, Junichi Yamagishi

TL;DR
This paper adapts speaker recognition techniques to learn and evaluate embedding spaces for musical instrument sounds, enabling recognition of unseen instruments for music synthesis applications.
Contribution
It introduces a novel approach using ASV architectures and evaluation methods for musical instrument sound embeddings, demonstrating effectiveness on multiple datasets.
Findings
Effective recognition of unseen instruments via EER metrics
Data augmentation and angular softmax improve embedding quality
Multi-task learning with instrument family labels enhances embedding structure
Abstract
Constructing an embedding space for musical instrument sounds that can meaningfully represent new and unseen instruments is important for downstream music generation tasks such as multi-instrument synthesis and timbre transfer. The framework of Automatic Speaker Verification (ASV) provides us with architectures and evaluation methodologies for verifying the identities of unseen speakers, and these can be repurposed for the task of learning and evaluating a musical instrument sound embedding space that can support unseen instruments. Borrowing from state-of-the-art ASV techniques, we construct a musical instrument recognition model that uses a SincNet front-end, a ResNet architecture, and an angular softmax objective function. Experiments on the NSynth and RWC datasets show our model's effectiveness in terms of equal error rate (EER) for unseen instruments, and ablation studies show the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Average Pooling · 1x1 Convolution · Residual Connection · Convolution · Batch Normalization · Global Average Pooling · Max Pooling · Bottleneck Residual Block · Kaiming Initialization
