Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities
Jinhua Liang, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan,, Emmanouil Benetos

TL;DR
This paper introduces Acoustic Prompt Tuning (APT), a method to extend large language and vision models to the audio domain using soft prompts and curriculum learning, enabling diverse audio tasks without fine-tuning.
Contribution
The work proposes a novel adapter, APT, that injects audio embeddings into LLMs and VLMs, along with a curriculum learning strategy and interleaved audio-text inputs for versatile audio understanding.
Findings
APT-LLMs achieve competitive results on various audio tasks.
The method extends VLMs to audio without fine-tuning.
Introduces natural language audio reasoning (NLAR) task.
Abstract
The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of language and vision understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capability. In this work, we introduce Acoustic Prompt Tuning (APT), a new adapter extending LLMs and VLMs to the audio domain by injecting audio embeddings to the input of LLMs, namely soft prompting. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as the inputs to the language model. To mitigate data scarcity in the audio domain, a curriculum learning strategy is proposed by formulating diverse audio tasks in a sequential manner.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Subtitles and Audiovisual Media
MethodsAdapter
