Hierarchical Granularity Alignment and State Space Modeling for Robust Multimodal AU Detection in the Wild

Jun Yu; Yunxiang Zhang; Naixiang Zheng; Lingsi Zhu; Guoyuan Wang

arXiv:2603.11306·cs.CV·March 13, 2026

Hierarchical Granularity Alignment and State Space Modeling for Robust Multimodal AU Detection in the Wild

Jun Yu, Yunxiang Zhang, Naixiang Zheng, Lingsi Zhu, Guoyuan Wang

PDF

Open Access

TL;DR

This paper introduces a robust multimodal framework for facial Action Unit detection in wild environments, utilizing hierarchical alignment, state space models, and foundation models to improve accuracy and capture long-range temporal dynamics.

Contribution

It proposes a novel multimodal framework with hierarchical alignment and state space modeling, leveraging foundation models for improved AU detection in challenging in-the-wild scenarios.

Findings

01

Achieved state-of-the-art performance on Aff-Wild2 dataset.

02

Secured top rankings in the Affective Behavior Analysis in-the-wild Competition.

03

Effectively captures ultra-long-range temporal dynamics with linear complexity.

Abstract

Facial Action Unit (AU) detection in in-the-wild environments remains a formidable challenge due to severe spatial-temporal heterogeneity, unconstrained poses, and complex audio-visual dependencies. While recent multimodal approaches have made progress, they often rely on capacity-limited encoders and shallow fusion mechanisms that fail to capture fine-grained semantic shifts and ultra-long temporal contexts. To bridge this gap, we propose a novel multimodal framework driven by Hierarchical Granularity Alignment and State Space Models.Specifically, we leverage powerful foundation models, namely DINOv2 and WavLM, to extract robust and high-fidelity visual and audio representations, effectively replacing traditional feature extractors. To handle extreme facial variations, our Hierarchical Granularity Alignment module dynamically aligns global facial semantics with fine-grained local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Face recognition and analysis · Speech and Audio Processing