TL;DR
This paper benchmarks 15 pre-trained vision models for domain-generalizable face anti-spoofing, demonstrating that self-supervised vision transformers, combined with data augmentation techniques, achieve state-of-the-art results efficiently.
Contribution
It provides a systematic evaluation of vision-only foundation models for FAS, establishing a robust and efficient baseline that surpasses existing methods in cross-domain scenarios.
Findings
Self-supervised vision models like DINOv2 excel in suppressing attention artifacts.
Combined with data augmentation, the vision-only baseline achieves state-of-the-art performance.
The approach outperforms existing methods under data-constrained protocols.
Abstract
Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
