LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs
Pooneh Mousavi, Shubham Gupta, Cem Subakan, Mirco Ravanelli

TL;DR
LiSTEN introduces a novel framework for adapting large language models to audio tasks by learning soft token embeddings, enabling efficient multitask learning with fewer data and improved interpretability.
Contribution
The paper presents LiSTEN, a dynamic prompt selection method with learnable key-value pairs that adapts LLMs to speech and audio tasks, reducing data dependence and overfitting.
Findings
Achieves competitive performance with fewer trainable parameters.
Simplifies training to a single-stage process.
Enhances interpretability through prompt analysis.
Abstract
Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Diverse Musicological Studies
