Video-Music Retrieval:A Dual-Path Cross-Modal Network
Xin Gu, Yinghua Shen, Chaohui Lv

TL;DR
This paper introduces a dual-path cross-modal network for video-music retrieval that integrates content and emotional information, significantly improving retrieval accuracy over existing methods.
Contribution
The paper presents a novel dual-path network architecture that combines content and emotional features for more effective video-music retrieval.
Findings
Recall@1 increased by 3.94
Recall@25 increased by 16.36
Effective merging of content and emotional information
Abstract
We propose a method to recommend background music for videos. Current work rarely considers the emotional information of music, which is essential for video music retrieval. To achieve this, we design two paths to process content information and emotional information between modal. Based on characteristics of video and music, we design various feature extraction schemes and common representation spaces. More importantly, we propose a way to combine content information with emotional information. Additionally, we make improvements to the classical metric loss to be more suited to this task. Experiments show that this dual path video music retrieval network can effectively merge information. Compare with existing methods, the retrieval task evaluation index: increasing Recall@1 by 3.94 and Recall@25 by 16.36.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Diverse Musicological Studies
