Futga: Towards Fine-grained Music Understanding through   Temporally-enhanced Generative Augmentation

Junda Wu; Zachary Novack; Amit Namburi; Jiaheng Dai; Hao-Wen Dong,; Zhouhang Xie; Carol Chen; Julian McAuley

arXiv:2407.20445·cs.SD·July 31, 2024

Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation

Junda Wu, Zachary Novack, Amit Namburi, Jiaheng Dai, Hao-Wen Dong,, Zhouhang Xie, Carol Chen, Julian McAuley

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

FUTGA introduces a novel approach for fine-grained, time-aware music understanding by leveraging generative augmentation with temporal compositions, enabling detailed segment descriptions and improved downstream task performance.

Contribution

The paper presents FUTGA, a model that synthesizes fine-grained, temporally-structured music captions using large language models and generative augmentation, enhancing music understanding and captioning accuracy.

Findings

01

FUTGA effectively captures temporal changes and musical functions in full-length songs.

02

Generated captions improve performance in music retrieval and generation tasks.

03

The approach outperforms existing methods in fine-grained music description accuracy.

Abstract

Existing music captioning methods are limited to generating concise global descriptions of short music clips, which fail to capture fine-grained musical characteristics and time-aware musical changes. To address these limitations, we propose FUTGA, a model equipped with fined-grained music understanding capabilities through learning from generative augmentation with temporal compositions. We leverage existing music caption datasets and large language models (LLMs) to synthesize fine-grained music captions with structural descriptions and time boundaries for full-length songs. Augmented by the proposed synthetic dataset, FUTGA is enabled to identify the music's temporal changes at key transition points and their musical functions, as well as generate detailed descriptions for each music segment. We further introduce a full-length music caption dataset generated by FUTGA, as the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/JoshuaW1997/FUTGA
noneOfficial

Models

🤗
JoshuaW1997/FUTGA
model· ♡ 17
♡ 17

Datasets

JoshuaW1997/FUTGA
dataset· 75 dl
75 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies