Hierarchical Vision-Language Interaction for Facial Action Unit Detection
Yong Li, Yi Ren, Yizhe Zhang, Wenhua Zhang, Tianyi Zhang, Muyun Jiang, Guo-Sen Xie, Cuntai Guan

TL;DR
This paper introduces HiVA, a hierarchical vision-language model that leverages textual AU descriptions and advanced cross-modal attention to improve facial Action Unit detection, especially under limited data conditions.
Contribution
The paper proposes a novel hierarchical cross-modal framework with AU-aware graph modules and dual attention mechanisms, enhancing AU detection by integrating rich language priors and visual features.
Findings
HiVA outperforms existing state-of-the-art AU detection methods.
The model produces semantically meaningful activation patterns.
It demonstrates robustness and interpretability in cross-modal AU analysis.
Abstract
Facial Action Unit (AU) detection seeks to recognize subtle facial muscle activations as defined by the Facial Action Coding System (FACS). A primary challenge w.r.t AU detection is the effective learning of discriminative and generalizable AU representations under conditions of limited annotated data. To address this, we propose a Hierarchical Vision-language Interaction for AU Understanding (HiVA) method, which leverages textual AU descriptions as semantic priors to guide and enhance AU detection. Specifically, HiVA employs a large language model to generate diverse and contextually rich AU descriptions to strengthen language-based representation learning. To capture both fine-grained and holistic vision-language associations, HiVA introduces an AU-aware dynamic graph module that facilitates the learning of AU-specific visual representations. These features are further integrated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Face recognition and analysis · Face Recognition and Perception
