PALM: Few-Shot Prompt Learning for Audio Language Models
Asif Hanif, Maha Tufail Agro, Mohammad Areeb Qazi, Hanan Aldarmaki

TL;DR
PALM introduces a novel prompt learning method for Audio-Language Models that optimizes text encoder features, improving efficiency and performance in few-shot audio recognition tasks across multiple datasets.
Contribution
The paper proposes PALM, a new prompt learning approach that enhances ALMs by optimizing text encoder features, leading to better efficiency and comparable or superior results.
Findings
PALM outperforms baseline methods in few-shot audio recognition.
The approach is computationally more efficient than existing methods.
Effective across diverse speech-processing datasets.
Abstract
Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
