ALCAP: Alignment-Augmented Music Captioner
Zihao He, Weituo Hao, Wei-Tsung Lu, Changyou Chen, Kristina Lerman,, Xuchen Song

TL;DR
This paper introduces ALCAP, a novel method that uses contrastive learning to align audio and lyrics in music captioning, leading to more coherent and high-quality descriptions, and achieves state-of-the-art results.
Contribution
The paper presents a new contrastive learning approach for multimodal alignment of audio and lyrics in music captioning, enhancing cross-modal coherence.
Findings
Achieves state-of-the-art performance on two datasets.
Demonstrates the effectiveness of multimodal alignment.
Provides theoretical and empirical validation.
Abstract
Music captioning has gained significant attention in the wake of the rising prominence of streaming media platforms. Traditional approaches often prioritize either the audio or lyrics aspect of the music, inadvertently ignoring the intricate interplay between the two. However, a comprehensive understanding of music necessitates the integration of both these elements. In this study, we delve into this overlooked realm by introducing a method to systematically learn multimodal alignment between audio and lyrics through contrastive learning. This not only recognizes and emphasizes the synergy between audio and lyrics but also paves the way for models to achieve deeper cross-modal coherence, thereby producing high-quality captions. We provide both theoretical and empirical results demonstrating the advantage of the proposed method, which achieves new state-of-the-art on two music captioning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
