AdaLTM: Adaptive Layer-wise Task Vector Merging for Categorical Speech Emotion Recognition with ASR Knowledge Integration

Chia-Yu Lee; Huang-Cheng Chou; Tzu-Quan Lin; Yuanchao Li; Ya-Tse Wu; Shrikanth Narayanan; Chi-Chun Lee

arXiv:2603.25041·eess.AS·March 27, 2026

AdaLTM: Adaptive Layer-wise Task Vector Merging for Categorical Speech Emotion Recognition with ASR Knowledge Integration

Chia-Yu Lee, Huang-Cheng Chou, Tzu-Quan Lin, Yuanchao Li, Ya-Tse Wu, Shrikanth Narayanan, Chi-Chun Lee

PDF

Open Access

TL;DR

This paper introduces AdaLTM, a novel method for integrating ASR and SER tasks by layer-wise merging of task vectors, improving emotion recognition by balancing linguistic and paralinguistic information without gradient conflicts.

Contribution

The paper proposes a new layer-wise task vector merging framework that leverages frozen models and learnable coefficients to enhance speech emotion recognition with ASR knowledge.

Findings

01

Effective mitigation of conflicts between ASR and SER.

02

Improved emotion recognition accuracy on MSP-Podcast.

03

Layer-wise integration balances linguistic and emotional features.

Abstract

Integrating Automatic Speech Recognition (ASR) into Speech Emotion Recognition (SER) enhances modeling by providing linguistic context. However, conventional feature fusion faces performance bottlenecks, and multi-task learning often suffers from optimization conflicts. While task vectors and model merging have addressed such conflicts in NLP and CV, their potential in speech tasks remains largely unexplored. In this work, we propose an Adaptive Layer-wise Task Vector Merging (AdaLTM) framework based on WavLM-Large. Instead of joint optimization, we extract task vectors from in-domain ASR and SER models fine-tuned on emotion datasets. These vectors are integrated into a frozen base model using layer-wise learnable coefficients. This strategy enables depth-aware balancing of linguistic and paralinguistic knowledge across transformer layers without gradient interference. Experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Music and Audio Processing