MTCAE-DFER: Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition
Peihao Xiang, Kaida Wu, Ou Bai

TL;DR
This paper introduces MTCAE-DFER, a multi-task cascaded autoencoder framework utilizing Vision Transformers to improve dynamic facial expression recognition by integrating local and global dynamic features and reducing overfitting.
Contribution
It proposes a novel plug-and-play cascaded decoder based on ViT architecture for multi-task learning in dynamic facial expression recognition, enhancing feature interaction and model robustness.
Findings
Outperforms state-of-the-art methods on public datasets.
Demonstrates improved generalization through multi-task learning.
Shows robustness and effectiveness of global-local feature interaction.
Abstract
This paper expands the cascaded network branch of the autoencoder-based multi-task learning (MTL) framework for dynamic facial expression recognition, namely Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition (MTCAE-DFER). MTCAE-DFER builds a plug-and-play cascaded decoder module, which is based on the Vision Transformer (ViT) architecture and employs the decoder concept of Transformer to reconstruct the multi-head attention module. The decoder output from the previous task serves as the query (Q), representing local dynamic features, while the Video Masked Autoencoder (VideoMAE) shared encoder output acts as both the key (K) and value (V), representing global dynamic features. This setup facilitates interaction between global and local dynamic features across related tasks. Additionally, this proposal aims to alleviate overfitting of complex large model. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Face and Expression Recognition
MethodsAttention Is All You Need · Byte Pair Encoding · Linear Layer · Absolute Position Encodings · Dropout · Softmax · Dense Connections · Residual Connection · Vision Transformer · Multi-Head Attention
