ViTransPAD: Video Transformer using convolution and self-attention for Face Presentation Attack Detection
Zuheng Ming, Zitong Yu, Musab Al-Ghadi, Muriel Visani, Muhammad, MuzzamilLuqman, Jean-Christophe Burie

TL;DR
ViTransPAD introduces a novel video transformer architecture with multi-scale attention and convolution integration for improved face presentation attack detection, capturing both local details and long-range temporal dependencies.
Contribution
The paper proposes ViTransPAD, a new video transformer model with multi-scale self-attention and convolutional components, enhancing face PAD by learning fine-grained pixel-level discrimination.
Findings
Achieves superior accuracy in face PAD tasks.
Balances computational efficiency with detection performance.
Outperforms existing CNN and transformer-based methods.
Abstract
Face Presentation Attack Detection (PAD) is an important measure to prevent spoof attacks for face biometric systems. Many works based on Convolution Neural Networks (CNNs) for face PAD formulate the problem as an image-level binary classification task without considering the context. Alternatively, Vision Transformers (ViT) using self-attention to attend the context of an image become the mainstreams in face PAD. Inspired by ViT, we propose a Video-based Transformer for face PAD (ViTransPAD) with short/long-range spatio-temporal attention which can not only focus on local details with short attention within a frame but also capture long-range dependencies over frames. Instead of using coarse image patches with single-scale as in ViT, we propose the Multi-scale Multi-Head Self-Attention (MsMHSA) architecture to accommodate multi-scale patch partitions of Q, K, V feature maps to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiometric Identification and Security · Face recognition and analysis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Absolute Position Encodings · Byte Pair Encoding · Softmax · Position-Wise Feed-Forward Layer · Residual Connection · Layer Normalization
