I can listen but cannot read: An evaluation of two-tower multimodal   systems for instrument recognition

Yannis Vasilakis; Rachel Bittner; Johan Pauwels

arXiv:2407.18058·cs.SD·July 26, 2024

I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition

Yannis Vasilakis, Rachel Bittner, Johan Pauwels

PDF

Open Access 1 Repo

TL;DR

This paper evaluates two-tower multimodal systems for zero-shot instrument recognition, revealing strengths in audio encoding but challenges in text encoding and joint space understanding, highlighting areas for improvement in semantic comprehension.

Contribution

It provides a detailed analysis of the properties of audio-text embeddings in two-tower systems for instrument recognition and introduces a novel method to assess semantic meaningfulness using an instrument ontology.

Findings

01

Audio encoders perform well independently.

02

Text encoders and joint spaces show sensitivity to specific words.

03

Systems lack effective use of textual context for accurate instrument inference.

Abstract

Music two-tower multimodal systems integrate audio and text modalities into a joint audio-text space, enabling direct comparison between songs and their corresponding labels. These systems enable new approaches for classification and retrieval, leveraging both modalities. Despite the promising results they have shown for zero-shot classification and retrieval tasks, closer inspection of the embeddings is needed. This paper evaluates the inherent zero-shot properties of joint audio-text spaces for the case-study of instrument recognition. We present an evaluation and analysis of two-tower systems for zero-shot instrument recognition and a detailed analysis of the properties of the pre-joint and joint embeddings spaces. Our findings suggest that audio encoders alone demonstrate good quality, while challenges remain within the text encoder or joint space projection. Specifically, two-tower…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YannisBilly/i_can_listen_but_cannot_read
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and dialogue systems

MethodsOntology