Face Pyramid Vision Transformer

Khawar Islam; Muhammad Zaigham Zaheer; Arif Mahmood

arXiv:2210.11974·cs.CV·February 24, 2026

Face Pyramid Vision Transformer

Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood

PDF

Open Access 1 Repo

TL;DR

The paper introduces Face Pyramid Vision Transformer (FPVT), a new model that combines CNN and ViT features to improve face recognition accuracy while reducing computational costs.

Contribution

FPVT integrates novel modules like FSRA, FDR, IPE, and CFFN to enhance multi-scale facial feature learning and efficiency in face recognition tasks.

Findings

01

FPVT outperforms ten state-of-the-art methods on seven benchmarks.

02

FPVT achieves high accuracy with fewer parameters.

03

FPVT demonstrates robustness across diverse datasets.

Abstract

A novel Face Pyramid Vision Transformer (FPVT) is proposed to learn a discriminative multi-scale facial representations for face recognition and verification. In FPVT, Face Spatial Reduction Attention (FSRA) and Dimensionality Reduction (FDR) layers are employed to make the feature maps compact, thus reducing the computations. An Improved Patch Embedding (IPE) algorithm is proposed to exploit the benefits of CNNs in ViTs (e.g., shared weights, local context, and receptive fields) to model lower-level edges to higher-level semantic primitives. Within FPVT framework, a Convolutional Feed-Forward Network (CFFN) is proposed that extracts locality information to learn low level facial information. The proposed FPVT is evaluated on seven benchmark datasets and compared with ten existing state-of-the-art methods, including CNNs, pure ViTs, and Convolutional ViTs. Despite fewer parameters, FPVT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

khawar-islam/fpvt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Face and Expression Recognition · Video Surveillance and Tracking Methods

MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Linear Layer · Label Smoothing · Absolute Position Encodings · Layer Normalization · Byte Pair Encoding · Residual Connection · Dropout