A Multi-modal and Multi-task Learning Method for Action Unit and Expression Recognition
Yue Jin, Tianqing Zheng, Chao Gao, Guoqiang Xu

TL;DR
This paper presents a multi-modal, multi-task learning approach utilizing visual and audio data for in-the-wild human affect analysis, improving action unit and expression recognition performance.
Contribution
It introduces a novel multi-modal, multi-task framework combining visual and audio cues with sequence modeling for affect recognition in unconstrained environments.
Findings
Achieved AU score of 0.712 on validation set
Achieved expression score of 0.477 on validation set
Demonstrated effectiveness in in-the-wild affect analysis
Abstract
Analyzing human affect is vital for human-computer interaction systems. Most methods are developed in restricted scenarios which are not practical for in-the-wild settings. The Affective Behavior Analysis in-the-wild (ABAW) 2021 Contest provides a benchmark for this in-the-wild problem. In this paper, we introduce a multi-modal and multi-task learning method by using both visual and audio information. We use both AU and expression annotations to train the model and apply a sequence model to further extract associations between video frames. We achieve an AU score of 0.712 and an expression score of 0.477 on the validation set. These results demonstrate the effectiveness of our approach in improving model performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Human Pose and Action Recognition · Social Robot Interaction and HRI
