Can Impressions of Music be Extracted from Thumbnail Images?
Takashi Harada, Takehiro Motomitsu, Katsuhiko Hayashi, Yusuke Sakai,, Hidetaka Kamigaito

TL;DR
This paper introduces a novel approach to generate music captions incorporating non-musical aspects from thumbnail images, creating a large dataset and training a music retrieval model validated by human evaluations.
Contribution
The paper presents a new method for generating music captions with non-musical information from images and provides a large dataset for music retrieval research.
Findings
Created a dataset with 360,000 captions including non-musical aspects.
Demonstrated effectiveness of the retrieval model in experiments.
Validated approach through human evaluations.
Abstract
In recent years, there has been a notable increase in research on machine learning models for music retrieval and generation systems that are capable of taking natural language sentences as inputs. However, there is a scarcity of large-scale publicly available datasets, consisting of music data and their corresponding natural language descriptions known as music captions. In particular, non-musical information such as suitable situations for listening to a track and the emotions elicited upon listening is crucial for describing music. This type of information is underrepresented in existing music caption datasets due to the challenges associated with extracting it directly from music data. To address this issue, we propose a method for generating music caption data that incorporates non-musical aspects inferred from music thumbnail images, and validated the effectiveness of our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
