Music Grounding by Short Video

Zijie Xin; Minquan Wang; Jingyu Liu; Ye Ma; Quan Chen; Peng Jiang; Xirong Li

arXiv:2408.16990·cs.MM·July 22, 2025

Music Grounding by Short Video

Zijie Xin, Minquan Wang, Jingyu Liu, Ye Ma, Quan Chen, Peng Jiang, Xirong Li

PDF

Open Access 1 Datasets

TL;DR

This paper introduces the new task of Music Grounding by Short Video (MGSV), aiming to localize suitable music segments for short videos, and presents a large benchmark dataset and a unified deep learning baseline method.

Contribution

The paper proposes MGSV as a novel task, creates the MGSV-EC benchmark dataset, and develops MaDe, an end-to-end model for music matching and localization.

Findings

01

MGSV is a challenging task as shown by extensive experiments.

02

MaDe achieves strong baseline performance on MGSV-EC.

03

The dataset contains 53k videos and 35k music moments.

Abstract

Adding proper background music helps complete a short video to be shared. Previous work tackles the task by video-to-music retrieval (V2MR), aiming to find the most suitable music track from a collection to match the content of a given query video. In practice, however, music tracks are typically much longer than the query video, necessitating (manual) trimming of the retrieved music to a shorter segment that matches the video duration. In order to bridge the gap between the practical need for music moment localization and V2MR, we propose a new task termed Music Grounding by Short Video (MGSV). To tackle the new task, we introduce a new benchmark, MGSV-EC, which comprises a diverse set of 53k short videos associated with 35k different music moments from 4k unique music tracks. Furthermore, we develop a new baseline method, MaDe, which performs both video-to-music matching and music…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

xxayt/MGSV-EC
dataset· 31 dl
31 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Diverse Musicological Studies · Music Technology and Sound Studies