TAG: Thinking with Action Unit Grounding for Facial Expression Recognition
Haobo Lin, Tianyi Bai, Jiajun Zhang, Xuanhao Chang, Sheng Lu, Fangming Gu, Zengjie Hu, Wentao Zhang

TL;DR
This paper introduces TAG, a vision-language framework for facial expression recognition that grounds reasoning in facial Action Units to improve accuracy and verifiability, reducing hallucinations and enhancing robustness.
Contribution
The paper proposes a novel AU-grounded reasoning framework for FER that combines supervised fine-tuning and reinforcement learning to produce verifiable and robust predictions.
Findings
Outperforms existing VLM baselines on multiple datasets.
Improves visual faithfulness and reduces hallucination.
AU-grounded rewards stabilize reasoning processes.
Abstract
Facial Expression Recognition (FER) is a fine-grained visual understanding task where reliable predictions require reasoning over localized and meaningful facial cues. Recent vision--language models (VLMs) enable natural language explanations for FER, but their reasoning is often ungrounded, producing fluent yet unverifiable rationales that are weakly tied to visual evidence and prone to hallucination, leading to poor robustness across different datasets. We propose TAG (Thinking with Action Unit Grounding), a vision--language framework that explicitly constrains multimodal reasoning to be supported by facial Action Units (AUs). TAG requires intermediate reasoning steps to be grounded in AU-related facial regions, yielding predictions accompanied by verifiable visual evidence. The model is trained via supervised fine-tuning on AU-grounded reasoning traces followed by reinforcement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Face Recognition and Perception · Multimodal Machine Learning Applications
