CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models

Videet Mehta; Liming Wang; Hilde Kuehne; Rogerio Feris; James R. Glass; M. Jehanzeb Mirza

arXiv:2602.07077·cs.SD·March 24, 2026

CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models

Videet Mehta, Liming Wang, Hilde Kuehne, Rogerio Feris, James R. Glass, M. Jehanzeb Mirza

PDF

Open Access

TL;DR

This paper introduces Class-Conditional Sparse Attention Vectors, a method that learns class-specific importance weights for attention heads in large audio-language models, significantly improving few-shot classification performance.

Contribution

It proposes a novel class-conditional weighting scheme for attention heads, enabling specialization and improved ensemble predictions in large audio-language models.

Findings

01

Outperforms state-of-the-art uniform voting methods by up to 14.52% in accuracy.

02

Achieves up to 8.35% absolute gains in spoofing detection.

03

Demonstrates consistent improvements across multiple audio and audiovisual classification benchmarks.

Abstract

Large audio-language models (LALMs) exhibit strong zero-shot capabilities in multiple downstream tasks, such as audio question answering (AQA) and abstract reasoning; however, these models still lag behind specialized models for certain discriminative tasks (e.g., audio classification). Recent studies show that sparse subsets of attention heads within an LALM can serve as strong discriminative feature extractors for downstream tasks such as classification via simple voting schemes. However, these methods assign uniform weights to all selected heads, implicitly assuming that each head contributes equally across all semantic categories. In this work, we propose Class-Conditional Sparse Attention Vectors for Large Audio-Language Models, a few-shot classification method that learns class-dependent importance weights over attention heads. This formulation allows individual heads to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis