Adaptive Discovery of Interpretable Audio Attributes with Multimodal LLMs for Low-Resource Classification

Kosuke Yoshimura; Hisashi Kashima

arXiv:2603.06991·cs.SD·March 10, 2026

Adaptive Discovery of Interpretable Audio Attributes with Multimodal LLMs for Low-Resource Classification

Kosuke Yoshimura, Hisashi Kashima

PDF

Open Access

TL;DR

This paper introduces a rapid, adaptive method using multimodal large language models to discover interpretable audio attributes, enhancing low-resource audio classification with high accuracy and efficiency.

Contribution

It replaces human-driven attribute discovery with an MLLM-based approach, significantly speeding up the process and improving classification performance in low-resource settings.

Findings

01

Outperforms direct MLLM prediction in most cases

02

Completes training within 11 minutes

03

Provides a practical, adaptive attribute discovery method

Abstract

In predictive modeling for low-resource audio classification, extracting high-accuracy and interpretable attributes is critical. Particularly in high-reliability applications, interpretable audio attributes are indispensable. While human-driven attribute discovery is effective, its low throughput becomes a bottleneck. We propose a method for adaptively discovering interpretable audio attributes using Multimodal Large Language Models (MLLMs). By replacing humans in the AdaFlock framework with MLLMs, our method achieves significantly faster attribute discovery. Our method dynamically identifies salient acoustic characteristics via prompting and constructs an attribute-based ensemble classifier. Experimental results across various audio tasks demonstrate that our method outperforms direct MLLM prediction in the majority of evaluated cases. The entire training completes within 11 minutes,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Explainable Artificial Intelligence (XAI)