I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition
Yannis Vasilakis, Rachel Bittner, Johan Pauwels

TL;DR
This paper evaluates two-tower multimodal systems for zero-shot instrument recognition, revealing strengths in audio encoding but challenges in text encoding and joint space understanding, highlighting areas for improvement in semantic comprehension.
Contribution
It provides a detailed analysis of the properties of audio-text embeddings in two-tower systems for instrument recognition and introduces a novel method to assess semantic meaningfulness using an instrument ontology.
Findings
Audio encoders perform well independently.
Text encoders and joint spaces show sensitivity to specific words.
Systems lack effective use of textual context for accurate instrument inference.
Abstract
Music two-tower multimodal systems integrate audio and text modalities into a joint audio-text space, enabling direct comparison between songs and their corresponding labels. These systems enable new approaches for classification and retrieval, leveraging both modalities. Despite the promising results they have shown for zero-shot classification and retrieval tasks, closer inspection of the embeddings is needed. This paper evaluates the inherent zero-shot properties of joint audio-text spaces for the case-study of instrument recognition. We present an evaluation and analysis of two-tower systems for zero-shot instrument recognition and a detailed analysis of the properties of the pre-joint and joint embeddings spaces. Our findings suggest that audio encoders alone demonstrate good quality, while challenges remain within the text encoder or joint space projection. Specifically, two-tower…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and dialogue systems
MethodsOntology
