Hierarchical Vision-Language Interaction for Facial Action Unit Detection

Yong Li; Yi Ren; Yizhe Zhang; Wenhua Zhang; Tianyi Zhang; Muyun Jiang; Guo-Sen Xie; Cuntai Guan

arXiv:2602.14425·cs.CV·February 17, 2026

Hierarchical Vision-Language Interaction for Facial Action Unit Detection

Yong Li, Yi Ren, Yizhe Zhang, Wenhua Zhang, Tianyi Zhang, Muyun Jiang, Guo-Sen Xie, Cuntai Guan

PDF

Open Access

TL;DR

This paper introduces HiVA, a hierarchical vision-language model that leverages textual AU descriptions and advanced cross-modal attention to improve facial Action Unit detection, especially under limited data conditions.

Contribution

The paper proposes a novel hierarchical cross-modal framework with AU-aware graph modules and dual attention mechanisms, enhancing AU detection by integrating rich language priors and visual features.

Findings

01

HiVA outperforms existing state-of-the-art AU detection methods.

02

The model produces semantically meaningful activation patterns.

03

It demonstrates robustness and interpretability in cross-modal AU analysis.

Abstract

Facial Action Unit (AU) detection seeks to recognize subtle facial muscle activations as defined by the Facial Action Coding System (FACS). A primary challenge w.r.t AU detection is the effective learning of discriminative and generalizable AU representations under conditions of limited annotated data. To address this, we propose a Hierarchical Vision-language Interaction for AU Understanding (HiVA) method, which leverages textual AU descriptions as semantic priors to guide and enhance AU detection. Specifically, HiVA employs a large language model to generate diverse and contextually rich AU descriptions to strengthen language-based representation learning. To capture both fine-grained and holistic vision-language associations, HiVA introduces an AU-aware dynamic graph module that facilitates the learning of AU-specific visual representations. These features are further integrated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Face recognition and analysis · Face Recognition and Perception