Face Pyramid Vision Transformer
Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood

TL;DR
The paper introduces Face Pyramid Vision Transformer (FPVT), a new model that combines CNN and ViT features to improve face recognition accuracy while reducing computational costs.
Contribution
FPVT integrates novel modules like FSRA, FDR, IPE, and CFFN to enhance multi-scale facial feature learning and efficiency in face recognition tasks.
Findings
FPVT outperforms ten state-of-the-art methods on seven benchmarks.
FPVT achieves high accuracy with fewer parameters.
FPVT demonstrates robustness across diverse datasets.
Abstract
A novel Face Pyramid Vision Transformer (FPVT) is proposed to learn a discriminative multi-scale facial representations for face recognition and verification. In FPVT, Face Spatial Reduction Attention (FSRA) and Dimensionality Reduction (FDR) layers are employed to make the feature maps compact, thus reducing the computations. An Improved Patch Embedding (IPE) algorithm is proposed to exploit the benefits of CNNs in ViTs (e.g., shared weights, local context, and receptive fields) to model lower-level edges to higher-level semantic primitives. Within FPVT framework, a Convolutional Feed-Forward Network (CFFN) is proposed that extracts locality information to learn low level facial information. The proposed FPVT is evaluated on seven benchmark datasets and compared with ten existing state-of-the-art methods, including CNNs, pure ViTs, and Convolutional ViTs. Despite fewer parameters, FPVT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Face and Expression Recognition · Video Surveillance and Tracking Methods
MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Linear Layer · Label Smoothing · Absolute Position Encodings · Layer Normalization · Byte Pair Encoding · Residual Connection · Dropout
