Facial Emotion Learning with Text-Guided Multiview Fusion via Vision-Language Model for 3D/4D Facial Expression Recognition
Muzammil Behzad

TL;DR
This paper introduces FACET-VLM, a novel vision-language framework that combines multiview facial representation learning with semantic guidance to improve 3D/4D facial expression recognition, achieving state-of-the-art results.
Contribution
The paper proposes a new multiview fusion framework with semantic guidance and consistency loss, advancing 3D/4D facial expression recognition methods.
Findings
Achieves state-of-the-art accuracy on multiple benchmarks.
Effectively captures subtle micro-expressions in 4D data.
Demonstrates robustness across posed and spontaneous expressions.
Abstract
Facial expression recognition (FER) in 3D and 4D domains presents a significant challenge in affective computing due to the complexity of spatial and temporal facial dynamics. Its success is crucial for advancing applications in human behavior understanding, healthcare monitoring, and human-computer interaction. In this work, we propose FACET-VLM, a vision-language framework for 3D/4D FER that integrates multiview facial representation learning with semantic guidance from natural language prompts. FACET-VLM introduces three key components: Cross-View Semantic Aggregation (CVSA) for view-consistent fusion, Multiview Text-Guided Fusion (MTGF) for semantically aligned facial emotions, and a multiview consistency loss to enforce structural coherence across views. Our model achieves state-of-the-art accuracy across multiple benchmarks, including BU-3DFE, Bosphorus, BU-4DFE, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Face recognition and analysis · Face Recognition and Perception
