Hierarchical Granularity Alignment and State Space Modeling for Robust Multimodal AU Detection in the Wild
Jun Yu, Yunxiang Zhang, Naixiang Zheng, Lingsi Zhu, Guoyuan Wang

TL;DR
This paper introduces a robust multimodal framework for facial Action Unit detection in wild environments, utilizing hierarchical alignment, state space models, and foundation models to improve accuracy and capture long-range temporal dynamics.
Contribution
It proposes a novel multimodal framework with hierarchical alignment and state space modeling, leveraging foundation models for improved AU detection in challenging in-the-wild scenarios.
Findings
Achieved state-of-the-art performance on Aff-Wild2 dataset.
Secured top rankings in the Affective Behavior Analysis in-the-wild Competition.
Effectively captures ultra-long-range temporal dynamics with linear complexity.
Abstract
Facial Action Unit (AU) detection in in-the-wild environments remains a formidable challenge due to severe spatial-temporal heterogeneity, unconstrained poses, and complex audio-visual dependencies. While recent multimodal approaches have made progress, they often rely on capacity-limited encoders and shallow fusion mechanisms that fail to capture fine-grained semantic shifts and ultra-long temporal contexts. To bridge this gap, we propose a novel multimodal framework driven by Hierarchical Granularity Alignment and State Space Models.Specifically, we leverage powerful foundation models, namely DINOv2 and WavLM, to extract robust and high-fidelity visual and audio representations, effectively replacing traditional feature extractors. To handle extreme facial variations, our Hierarchical Granularity Alignment module dynamically aligns global facial semantics with fine-grained local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Face recognition and analysis · Speech and Audio Processing
