MATS: An Audio Language Model under Text-only Supervision

Wen Wang; Ruibing Hou; Hong Chang; Shiguang Shan; Xilin Chen

arXiv:2502.13433·cs.SD·January 15, 2026

MATS: An Audio Language Model under Text-only Supervision

Wen Wang, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen

PDF

Open Access

TL;DR

MATS introduces a novel text-only training approach for audio-language models, enabling audio comprehension without audio data, by leveraging pre-trained alignment models and a new modality bridging mechanism.

Contribution

It proposes a new training strategy that uses only text supervision and pre-trained models to endow language models with audio understanding capabilities.

Findings

01

Achieves competitive performance with models trained on audio data

02

Demonstrates effective audio comprehension without audio training data

03

Introduces the Santa mechanism for modality bridging

Abstract

Large audio-language models (LALMs), built upon powerful Large Language Models (LLMs), have exhibited remarkable audio comprehension and reasoning capabilities. However, the training of LALMs demands a large corpus of audio-language pairs, which requires substantial costs in both data collection and training resources. In this paper, we propose \textbf{MATS}, an audio-language multimodal LLM designed to handle \textbf{M}ultiple \textbf{A}udio task using solely \textbf{T}ext-only \textbf{S}upervision. By leveraging pre-trained audio-language alignment models such as CLAP, we develop a text-only training strategy that projects the shared audio-language latent space into LLM latent space, endowing the LLM with audio comprehension capabilities without relying on audio data during training. To further bridge the modality gap between audio and language embeddings within CLAP, we propose the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques