A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition
Jiajun Sun, Zhe Gao

TL;DR
This paper introduces a two-stage dual-modality model combining visual and audio features for facial emotional expression recognition in unconstrained videos, achieving state-of-the-art results.
Contribution
The novel two-stage dual-modal approach effectively fuses audio-visual cues and enhances robustness for expression recognition in challenging conditions.
Findings
Achieved Macro-F1 score of 0.5368 on validation set.
Outperformed official baselines on ABAW dataset.
Demonstrated robustness to pose, scale, and motion variations.
Abstract
This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
