Adaptive Discovery of Interpretable Audio Attributes with Multimodal LLMs for Low-Resource Classification
Kosuke Yoshimura, Hisashi Kashima

TL;DR
This paper introduces a rapid, adaptive method using multimodal large language models to discover interpretable audio attributes, enhancing low-resource audio classification with high accuracy and efficiency.
Contribution
It replaces human-driven attribute discovery with an MLLM-based approach, significantly speeding up the process and improving classification performance in low-resource settings.
Findings
Outperforms direct MLLM prediction in most cases
Completes training within 11 minutes
Provides a practical, adaptive attribute discovery method
Abstract
In predictive modeling for low-resource audio classification, extracting high-accuracy and interpretable attributes is critical. Particularly in high-reliability applications, interpretable audio attributes are indispensable. While human-driven attribute discovery is effective, its low throughput becomes a bottleneck. We propose a method for adaptively discovering interpretable audio attributes using Multimodal Large Language Models (MLLMs). By replacing humans in the AdaFlock framework with MLLMs, our method achieves significantly faster attribute discovery. Our method dynamically identifies salient acoustic characteristics via prompting and constructs an attribute-based ensemble classifier. Experimental results across various audio tasks demonstrate that our method outperforms direct MLLM prediction in the majority of evaluated cases. The entire training completes within 11 minutes,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Explainable Artificial Intelligence (XAI)
