HeartMuLa: A Family of Open Sourced Music Foundation Models

Dongchao Yang; Yuxin Xie; Yuguo Yin; Zheyu Wang; Xiaoyu Yi; Gongxi Zhu; Xiaolong Weng; Zihan Xiong; Yingzhe Ma; Dading Cong; Jingliang Liu; Zihang Huang; Jinghan Ru; Rongjie Huang; Haoran Wan; Peixu Wang; Kuoxi Yu; Helin Wang; Liming Liang; Xianwei Zhuang; Yuanyuan Wang; Dingdong; Wang; Haohan Guo; Junjie Cao; Zeqian Ju; Songxiang Liu; Yuewen Cao; Heming Weng; Yuexian Zou

arXiv:2601.10547·cs.SD·January 27, 2026

HeartMuLa: A Family of Open Sourced Music Foundation Models

Dongchao Yang, Yuxin Xie, Yuguo Yin, Zheyu Wang, Xiaoyu Yi, Gongxi Zhu, Xiaolong Weng, Zihan Xiong, Yingzhe Ma, Dading Cong, Jingliang Liu, Zihang Huang, Jinghan Ru, Rongjie Huang, Haoran Wan, Peixu Wang, Kuoxi Yu, Helin Wang, Liming Liang, Xianwei Zhuang, Yuanyuan Wang

PDF

Open Access 10 Models

TL;DR

HeartMuLa introduces a comprehensive suite of open-source music foundation models that enable advanced music understanding and generation with user-controllable features and high fidelity, scalable to 7B parameters.

Contribution

This work presents a novel family of open-source models for music understanding and generation, including audio-text alignment, lyric recognition, music coding, and song synthesis, scalable to 7B parameters.

Findings

01

HeartMuLa models achieve high-quality music generation.

02

Scaling to 7B parameters significantly improves performance.

03

Open-source models serve as strong baselines for future research.

Abstract

We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor, a robust lyric recognition model optimized for real-world music scenarios; and (3) HeartCodec, a low-frame-rate (12.5 Hz) yet high-fidelity music codec tokenizer that captures long-range musical structure while preserving fine-grained acoustic details and enabling efficient autoregressive modeling; (4) HeartMuLa, an LLM-based song generation model capable of synthesizing high-fidelity music under rich, user-controllable conditions (e.g., textual style descriptions, lyrics, and reference audio). In addition, it provides two specialized modes: (i) fine-grained musical attribute control, which allows users to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Artificial Intelligence in Games