Complementarity-Supervised Spectral-Band Routing for Multimodal Emotion Recognition

Zhexian Huang; Bo Zhao; Hui Ma; Zhishu Liu; Jie Zhang; Ruixin Zhang; Shouhong Ding; Zitong Yu

arXiv:2603.13340·cs.CV·March 17, 2026

Complementarity-Supervised Spectral-Band Routing for Multimodal Emotion Recognition

Zhexian Huang, Bo Zhao, Hui Ma, Zhishu Liu, Jie Zhang, Ruixin Zhang, Shouhong Ding, Zitong Yu

PDF

Open Access

TL;DR

This paper introduces Atsuko, a novel multimodal emotion recognition model that decomposes features into frequency bands and uses a complementarity-guided routing mechanism to improve fusion of heterogeneous modalities.

Contribution

It proposes a multi-scale band decomposition and a complementarity-supervised routing framework for more effective multimodal emotion recognition.

Findings

01

Achieves superior performance on multiple emotion recognition benchmarks.

02

Effectively models fine-grained cross-modal interactions.

03

Mitigates dominance of certain modalities through complementarity supervision.

Abstract

Multimodal emotion recognition fuses cues such as text, video, and audio to understand individual emotional states. Prior methods face two main limitations: mechanically relying on independent unimodal performance, thereby missing genuine complementary contributions, and coarse-grained fusion conflicting with the fine-grained representations required by emotion tasks. As inconsistent information density across heterogeneous modalities hinders inter-modal feature mining, we propose the Complementarity-Supervised Multi-Band Expert Network, named Atsuko, to model fine-grained complementary features via multi-scale band decomposition and expert collaboration. Specifically, we orthogonally decompose each modality's features into high, mid, and low-frequency components. Building upon this band-level routing, we design a modality-level router with a dual-path mechanism for fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Face and Expression Recognition · Music and Audio Processing