Image2song: Song Retrieval via Bridging Image Content and Lyric Words
Xuelong Li, Di Hu, Xiaoqiang Lu

TL;DR
This paper introduces a novel framework for retrieving semantically relevant songs from images by learning correlations between image content and lyrics using deep neural networks, aiming to enhance emotional expression in social media.
Contribution
The paper proposes a new deep learning-based method for image-to-song retrieval that aligns image regions with lyric content through a tag attention mechanism, and introduces a multimodal dataset for this task.
Findings
The model effectively retrieves relevant songs based on image content.
The approach improves correlation between images and lyrics compared to baseline methods.
The dataset supports multimodal research in image and music retrieval.
Abstract
Image is usually taken for expressing some kinds of emotions or purposes, such as love, celebrating Christmas. There is another better way that combines the image and relevant song to amplify the expression, which has drawn much attention in the social network recently. Hence, the automatic selection of songs should be expected. In this paper, we propose to retrieve semantic relevant songs just by an image query, which is named as the image2song problem. Motivated by the requirements of establishing correlation in semantic/content, we build a semantic-based song retrieval framework, which learns the correlation between image content and lyric words. This model uses a convolutional neural network to generate rich tags from image regions, a recurrent neural network to model lyric, and then establishes correlation via a multi-layer perceptron. To reduce the content gap between image and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Multimodal Machine Learning Applications
