MuCPT: Music-related Natural Language Model Continued Pretraining

Kai Tian; Yirong Mao; Wendong Bi; Hanjie Wang; Que Wenhui

arXiv:2511.14245·cs.CL·November 19, 2025

MuCPT: Music-related Natural Language Model Continued Pretraining

Kai Tian, Yirong Mao, Wendong Bi, Hanjie Wang, Que Wenhui

PDF

Open Access

TL;DR

MuCPT introduces a large, curated music-related corpus and a specialized training framework for domain-specific language models, improving music knowledge and task alignment in LLMs.

Contribution

The paper presents a new music-focused pretraining dataset, a domain-first data pipeline, and a reference-model based quality control method for enhanced music LLMs.

Findings

01

Effective filtering and cleaning of music domain data.

02

Improved model alignment with music tasks.

03

Introduction of the MusicSimpleQA benchmark.

Abstract

Large language models perform strongly on general tasks but remain constrained in specialized settings such as music, particularly in the music-entertainment domain, where corpus scale, purity, and the match between data and training objectives are critical. We address this by constructing a large, music-related natural language corpus (40B tokens) that combines open source and in-house data, and by implementing a domain-first data pipeline: a lightweight classifier filters and weights in-domain text, followed by multi-stage cleaning, de-duplication, and privacy-preserving masking. We further integrate multi-source music text with associated metadata to form a broader, better-structured foundation of domain knowledge. On the training side, we introduce reference-model (RM)-based token-level soft scoring for quality control: a unified loss-ratio criterion is used both for data selection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Topic Modeling · Advanced Graph Neural Networks