MATS: An Audio Language Model under Text-only Supervision
Wen Wang, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen

TL;DR
MATS introduces a novel text-only training approach for audio-language models, enabling audio comprehension without audio data, by leveraging pre-trained alignment models and a new modality bridging mechanism.
Contribution
It proposes a new training strategy that uses only text supervision and pre-trained models to endow language models with audio understanding capabilities.
Findings
Achieves competitive performance with models trained on audio data
Demonstrates effective audio comprehension without audio training data
Introduces the Santa mechanism for modality bridging
Abstract
Large audio-language models (LALMs), built upon powerful Large Language Models (LLMs), have exhibited remarkable audio comprehension and reasoning capabilities. However, the training of LALMs demands a large corpus of audio-language pairs, which requires substantial costs in both data collection and training resources. In this paper, we propose \textbf{MATS}, an audio-language multimodal LLM designed to handle \textbf{M}ultiple \textbf{A}udio task using solely \textbf{T}ext-only \textbf{S}upervision. By leveraging pre-trained audio-language alignment models such as CLAP, we develop a text-only training strategy that projects the shared audio-language latent space into LLM latent space, endowing the LLM with audio comprehension capabilities without relying on audio data during training. To further bridge the modality gap between audio and language embeddings within CLAP, we propose the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
