MuCPT: Music-related Natural Language Model Continued Pretraining
Kai Tian, Yirong Mao, Wendong Bi, Hanjie Wang, Que Wenhui

TL;DR
MuCPT introduces a large, curated music-related corpus and a specialized training framework for domain-specific language models, improving music knowledge and task alignment in LLMs.
Contribution
The paper presents a new music-focused pretraining dataset, a domain-first data pipeline, and a reference-model based quality control method for enhanced music LLMs.
Findings
Effective filtering and cleaning of music domain data.
Improved model alignment with music tasks.
Introduction of the MusicSimpleQA benchmark.
Abstract
Large language models perform strongly on general tasks but remain constrained in specialized settings such as music, particularly in the music-entertainment domain, where corpus scale, purity, and the match between data and training objectives are critical. We address this by constructing a large, music-related natural language corpus (40B tokens) that combines open source and in-house data, and by implementing a domain-first data pipeline: a lightweight classifier filters and weights in-domain text, followed by multi-stage cleaning, de-duplication, and privacy-preserving masking. We further integrate multi-source music text with associated metadata to form a broader, better-structured foundation of domain knowledge. On the training side, we introduce reference-model (RM)-based token-level soft scoring for quality control: a unified loss-ratio criterion is used both for data selection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Topic Modeling · Advanced Graph Neural Networks
