Acoustic Prompt Tuning: Empowering Large Language Models with Audition   Capabilities

Jinhua Liang; Xubo Liu; Wenwu Wang; Mark D. Plumbley; Huy Phan,; Emmanouil Benetos

arXiv:2312.00249·eess.AS·February 19, 2025·1 cites

Acoustic Prompt Tuning: Empowering Large Language Models with Audition Capabilities

Jinhua Liang, Xubo Liu, Wenwu Wang, Mark D. Plumbley, Huy Phan,, Emmanouil Benetos

PDF

Open Access 2 Repos

TL;DR

This paper introduces Acoustic Prompt Tuning (APT), a method to extend large language and vision models to the audio domain using soft prompts and curriculum learning, enabling diverse audio tasks without fine-tuning.

Contribution

The work proposes a novel adapter, APT, that injects audio embeddings into LLMs and VLMs, along with a curriculum learning strategy and interleaved audio-text inputs for versatile audio understanding.

Findings

01

APT-LLMs achieve competitive results on various audio tasks.

02

The method extends VLMs to audio without fine-tuning.

03

Introduces natural language audio reasoning (NLAR) task.

Abstract

The auditory system plays a substantial role in shaping the overall human perceptual experience. While prevailing large language models (LLMs) and visual language models (VLMs) have shown their promise in solving a wide variety of language and vision understanding tasks, only a few of them can be generalised to the audio domain without compromising their domain-specific capability. In this work, we introduce Acoustic Prompt Tuning (APT), a new adapter extending LLMs and VLMs to the audio domain by injecting audio embeddings to the input of LLMs, namely soft prompting. Specifically, APT applies an instruction-aware audio aligner to generate soft prompts, conditioned on both input text and sounds, as the inputs to the language model. To mitigate data scarcity in the audio domain, a curriculum learning strategy is proposed by formulating diverse audio tasks in a sequential manner.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Subtitles and Audiovisual Media

MethodsAdapter