TL;DR
This paper introduces two simple, parameter-free attentive pooling modules for Vision Transformers in facial expression recognition, improving accuracy and efficiency by focusing on discriminative features and reducing noise.
Contribution
The paper proposes novel attentive pooling modules (APP and ATP) that enhance Vision Transformer performance in FER without additional learnable parameters.
Findings
Outperforms state-of-the-art on six in-the-wild datasets.
Reduces computational cost while boosting discriminative feature focus.
Demonstrates effectiveness through qualitative and quantitative analysis.
Abstract
Facial Expression Recognition (FER) in the wild is an extremely challenging task. Recently, some Vision Transformers (ViT) have been explored for FER, but most of them perform inferiorly compared to Convolutional Neural Networks (CNN). This is mainly because the new proposed modules are difficult to converge well from scratch due to lacking inductive bias and easy to focus on the occlusion and noisy areas. TransFER, a representative transformer-based method for FER, alleviates this with multi-branch attention dropping but brings excessive computations. On the contrary, we present two attentive pooling (AP) modules to pool noisy features directly. The AP modules include Attentive Patch Pooling (APP) and Attentive Token Pooling (ATP). They aim to guide the model to emphasize the most discriminative features while reducing the impacts of less relevant features. The proposed APP is employed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
