MUKA: Multi Kernel Audio Adaptation Of Audio-Language Models

Reda Bensaid; Amine Ouasfi; Yassir Bendou; Ilyass Moummad; Vincent Gripon; Fran\c{c}ois Leduc-Primeau; Adnane Boukhayma

arXiv:2602.14127·cs.SD·February 17, 2026

MUKA: Multi Kernel Audio Adaptation Of Audio-Language Models

Reda Bensaid, Amine Ouasfi, Yassir Bendou, Ilyass Moummad, Vincent Gripon, Fran\c{c}ois Leduc-Primeau, Adnane Boukhayma

PDF

Open Access

TL;DR

MUKA introduces a multi-kernel adaptation framework for large audio-language models, combining local and global representations to enable efficient few-shot adaptation without additional training, achieving state-of-the-art results.

Contribution

The paper proposes MUKA, a novel multi-kernel adaptation method that enhances audio-language model adaptation by integrating local and global semantic representations without extra training.

Findings

01

MUKA outperforms existing training-free adaptation methods on 11 audio datasets.

02

MUKA surpasses some training-based adapters in several scenarios.

03

The method maintains theoretical guarantees of kernel methods.

Abstract

Multimodal foundation models have demonstrated impressive generalization capabilities, yet efficiently adapting them to new tasks in a few-shot setting remains a critical challenge. In this work, we investigate the few-shot adaptation of Large Audio-Language Models (ALMs) through both training-based and training-free approaches. We introduce MUKA, a multi-kernel adaptation framework that combines the fine-grained, context-dependent representations of instruction-tuning based models like Pengi with the global semantic representations of contrastive pretraining methods like CLAP. By constructing a product kernel that aligns local similarity with global semantics, MUKA enhances representational power while preserving the theoretical guarantees of kernel methods and avoiding additional training. Extensive experiments across 11 diverse audio datasets demonstrate that MUKA achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis