Attention-weighted Centered Kernel Alignment for Knowledge Distillation in Large Audio-Language Models Applied to Speech Emotion Recognition

Qingran Yang; Botao Zhao; Zuheng Kang; Xue Li; Yayun He; Chuhang Liu; Xulong Zhang; Xiaoyang Qu; Junqing Peng; Jianzong Wang

arXiv:2602.01547·cs.SD·February 3, 2026

Attention-weighted Centered Kernel Alignment for Knowledge Distillation in Large Audio-Language Models Applied to Speech Emotion Recognition

Qingran Yang, Botao Zhao, Zuheng Kang, Xue Li, Yayun He, Chuhang Liu, Xulong Zhang, Xiaoyang Qu, Junqing Peng, Jianzong Wang

PDF

Open Access

TL;DR

This paper introduces PL-Distill, a novel knowledge distillation framework for large audio-language models in speech emotion recognition, utilizing attention-weighted kernel alignment for effective cross-modal feature alignment and model compression.

Contribution

It proposes a new attention-weighted centered kernel alignment method for better feature alignment in knowledge distillation of large audio-language models.

Findings

01

PL-Distill compresses models from 8.4B to 1.1B parameters.

02

It outperforms state-of-the-art models and baselines on multiple datasets.

03

The method effectively aligns cross-modal features despite dimension mismatches.

Abstract

The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in distilling the cross-modal projection module (Projector), and often struggle with alignment due to differences in feature dimensions. We propose PL-Distill, a KD framework that combines Projector-Level Distillation (PDist) to align audio embeddings and Logits-Level Distillation (LDist) to align output logits. PDist introduces Attention-weighted Centered Kernel Alignment, a novel approach we propose to highlight important time steps and address dimension mismatches. Meanwhile, LDist minimizes the Kullback-Leibler divergence between teacher and student logits from audio and text modalities. On IEMOCAP, RAVDESS, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech and Audio Processing