Unsupervised Learning of Semantic Audio Representations

Aren Jansen; Manoj Plakal; Ratheet Pandya; Daniel P. W. Ellis; Shawn; Hershey; Jiayang Liu; R. Channing Moore; Rif A. Saurous

arXiv:1711.02209·cs.SD·November 8, 2017·5 cites

Unsupervised Learning of Semantic Audio Representations

Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel P. W. Ellis, Shawn, Hershey, Jiayang Liu, R. Channing Moore, Rif A. Saurous

PDF

Open Access

TL;DR

This paper introduces an unsupervised approach to learn semantic audio representations by leveraging class-agnostic constraints, resulting in effective embeddings for sound retrieval and classification without labeled data.

Contribution

It proposes a novel triplet loss-based training method for CNNs that exploits semantic constraints in unlabeled audio, achieving competitive performance with supervised methods.

Findings

01

Achieves 41% of supervised retrieval performance

02

Achieves 84% of supervised classification performance

03

Doubles state-of-the-art in limited supervision scenarios

Abstract

Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the categories of the constituents, and (iii) the categories of events in close temporal proximity are likely to be the same or related. Without labels to ground them, these constraints are incompatible with classification loss functions. However, they may still be leveraged to identify geometric inequalities needed for triplet loss-based training of convolutional neural networks. The result is low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis