SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing

Ziyang Ma; Guanrou Yang; Wenxi Chen; Zhifu Gao; Yexing Du; Xiquan Li; Zhisheng Zheng; Haina Zhu; Jianheng Zhuo; Zheshu Song; Ruiyang Xu; Tiranrui Wang; Yifan Yang; Yanqiao Zhu; Zhikang Niu; Liumeng Xue; Yinghao Ma; Ruibin Yuan; Shiliang Zhang; Kai Yu; Eng Siong Chng; Xie Chen

arXiv:2601.09385·cs.SD·January 15, 2026

SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing

Ziyang Ma, Guanrou Yang, Wenxi Chen, Zhifu Gao, Yexing Du, Xiquan Li, Zhisheng Zheng, Haina Zhu, Jianheng Zhuo, Zheshu Song, Ruiyang Xu, Tiranrui Wang, Yifan Yang, Yanqiao Zhu, Zhikang Niu, Liumeng Xue, Yinghao Ma, Ruibin Yuan, Shiliang Zhang, Kai Yu, Eng Siong Chng, Xie Chen

PDF

Open Access

TL;DR

SLAM-LLM is an open-source framework that enables efficient training and deployment of multimodal large language models focused on speech, audio, and music, facilitating research and development in audio-language AI.

Contribution

It introduces a modular, customizable framework with detailed recipes and high-performance checkpoints for audio-language tasks, advancing the development of audio-focused multimodal models.

Findings

01

Achieved near state-of-the-art performance on several audio-language tasks

02

Provided flexible modules for encoders, projectors, and fine-tuning plugins

03

Released high-quality checkpoints for ASR, AAC, and Music Captioning

Abstract

The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing