FLIP: Cross-domain Face Anti-spoofing with Language Guidance
Koushik Srivatsan, Muzammal Naseer, Karthik Nandakumar

TL;DR
This paper introduces FLIP, a cross-domain face anti-spoofing method that leverages multimodal pre-trained vision-language models and natural language grounding to improve generalization and zero-shot transfer capabilities.
Contribution
The work demonstrates that initializing ViTs with multimodal pre-trained weights and aligning visual features with natural language descriptions enhances FAS generalization, introducing a novel multimodal contrastive learning strategy.
Findings
Outperforms state-of-the-art methods on standard protocols.
Achieves superior zero-shot transfer performance.
Improves robustness in low-data regimes.
Abstract
Face anti-spoofing (FAS) or presentation attack detection is an essential component of face recognition systems deployed in security-critical applications. Existing FAS methods have poor generalizability to unseen spoof types, camera sensors, and environmental conditions. Recently, vision transformer (ViT) models have been shown to be effective for the FAS task due to their ability to capture long-range dependencies among image patches. However, adaptive modules or auxiliary loss functions are often required to adapt pre-trained ViT weights learned on large-scale datasets such as ImageNet. In this work, we first show that initializing ViTs with multimodal (e.g., CLIP) pre-trained weights improves generalizability for the FAS task, which is in line with the zero-shot transfer capabilities of vision-language pre-trained (VLP) models. We then propose a novel approach for robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiometric Identification and Security · Face recognition and analysis · Reconstructive Facial Surgery Techniques
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention · Residual Connection · Dense Connections · Layer Normalization · Vision Transformer · Contrastive Learning
