MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation
Zihao Wang, Haoxuan Liu, Jiaxing Yu, Tao Zhang, Yan Liu, Kejun Zhang

TL;DR
This paper introduces a new task and dataset for aligning colloquial human descriptions with AI-generated music, proposing a novel end-to-end framework that improves human-AI musical collaboration.
Contribution
It presents the Colloquial Description-to-Song Generation task, a new dataset CaiMD, and the MuDiT/MuSiT framework for effective alignment of colloquial language with musical output.
Findings
CaiMD dataset offers diverse, high-quality colloquial music descriptions.
MuDiT/MuSiT achieves effective cross-modal alignment and cohesive music generation.
Framework enhances human-AI collaboration in creative music processes.
Abstract
Amid the rising intersection of generative AI and human artistic processes, this study probes the critical yet less-explored terrain of alignment in human-centric automatic song composition. We propose a novel task of Colloquial Description-to-Song Generation, which focuses on aligning the generated content with colloquial human expressions. This task is aimed at bridging the gap between colloquial language understanding and auditory expression within an AI model, with the ultimate goal of creating songs that accurately satisfy human auditory expectations and structurally align with musical norms. Current datasets are limited due to their narrow descriptive scope, semantic gaps and inaccuracies. To overcome data scarcity in this domain, we present the Caichong Music Dataset (CaiMD). CaiMD is manually annotated by both professional musicians and amateurs, offering diverse perspectives…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsALIGN
