Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation
Junda Wu, Zachary Novack, Amit Namburi, Jiaheng Dai, Hao-Wen Dong,, Zhouhang Xie, Carol Chen, Julian McAuley

TL;DR
FUTGA introduces a novel approach for fine-grained, time-aware music understanding by leveraging generative augmentation with temporal compositions, enabling detailed segment descriptions and improved downstream task performance.
Contribution
The paper presents FUTGA, a model that synthesizes fine-grained, temporally-structured music captions using large language models and generative augmentation, enhancing music understanding and captioning accuracy.
Findings
FUTGA effectively captures temporal changes and musical functions in full-length songs.
Generated captions improve performance in music retrieval and generation tasks.
The approach outperforms existing methods in fine-grained music description accuracy.
Abstract
Existing music captioning methods are limited to generating concise global descriptions of short music clips, which fail to capture fine-grained musical characteristics and time-aware musical changes. To address these limitations, we propose FUTGA, a model equipped with fined-grained music understanding capabilities through learning from generative augmentation with temporal compositions. We leverage existing music caption datasets and large language models (LLMs) to synthesize fine-grained music captions with structural descriptions and time boundaries for full-length songs. Augmented by the proposed synthetic dataset, FUTGA is enabled to identify the music's temporal changes at key transition points and their musical functions, as well as generate detailed descriptions for each music segment. We further introduce a full-length music caption dataset generated by FUTGA, as the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
