Toward Universal Text-to-Music Retrieval
SeungHeon Doh, Minz Won, Keunwoo Choi, Juhan Nam

TL;DR
This paper proposes design strategies for a universal text-to-music retrieval system capable of handling various input types, achieving comparable performance across different query formats and generalizing to multiple music classification tasks.
Contribution
It introduces a benchmark and design choices that enable a single system to effectively process diverse text inputs for music retrieval, surpassing previous single-query-type limitations.
Findings
Achieves comparable retrieval performance for tag- and sentence-level inputs.
Generalizes to 9 downstream music classification tasks.
Provides code and demo online for reproducibility.
Abstract
This paper introduces effective design choices for text-to-music retrieval systems. An ideal text-based retrieval system would support various input queries such as pre-defined tags, unseen tags, and sentence-level descriptions. In reality, most previous works mainly focused on a single query type (tag or sentence) which may not generalize to another input type. Hence, we review recent text-based music retrieval systems using our proposed benchmark in two main aspects: input text representation and training objectives. Our findings enable a universal text-to-music retrieval system that achieves comparable retrieval performances in both tag- and sentence-level inputs. Furthermore, the proposed multimodal representation generalizes to 9 different downstream music classification tasks. We present the code and demo online.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies
