MTCAE-DFER: Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition

Peihao Xiang; Kaida Wu; Ou Bai

arXiv:2412.18988·cs.CV·July 29, 2025

MTCAE-DFER: Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition

Peihao Xiang, Kaida Wu, Ou Bai

PDF

Open Access 1 Repo

TL;DR

This paper introduces MTCAE-DFER, a multi-task cascaded autoencoder framework utilizing Vision Transformers to improve dynamic facial expression recognition by integrating local and global dynamic features and reducing overfitting.

Contribution

It proposes a novel plug-and-play cascaded decoder based on ViT architecture for multi-task learning in dynamic facial expression recognition, enhancing feature interaction and model robustness.

Findings

01

Outperforms state-of-the-art methods on public datasets.

02

Demonstrates improved generalization through multi-task learning.

03

Shows robustness and effectiveness of global-local feature interaction.

Abstract

This paper expands the cascaded network branch of the autoencoder-based multi-task learning (MTL) framework for dynamic facial expression recognition, namely Multi-Task Cascaded Autoencoder for Dynamic Facial Expression Recognition (MTCAE-DFER). MTCAE-DFER builds a plug-and-play cascaded decoder module, which is based on the Vision Transformer (ViT) architecture and employs the decoder concept of Transformer to reconstruct the multi-head attention module. The decoder output from the previous task serves as the query (Q), representing local dynamic features, while the Video Masked Autoencoder (VideoMAE) shared encoder output acts as both the key (K) and value (V), representing global dynamic features. This setup facilitates interaction between global and local dynamic features across related tasks. Additionally, this proposal aims to alleviate overfitting of complex large model. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Peihao-Xiang/MTCAE-DFER
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Face and Expression Recognition

MethodsAttention Is All You Need · Byte Pair Encoding · Linear Layer · Absolute Position Encodings · Dropout · Softmax · Dense Connections · Residual Connection · Vision Transformer · Multi-Head Attention