TL;DR
InstructFLIP is a novel instruction-tuned vision-language framework for face anti-spoofing that improves cross-domain generalization by leveraging textual guidance and decoupling content and style instructions, reducing training redundancy.
Contribution
The paper introduces InstructFLIP, a unified vision-language model that enhances face anti-spoofing by integrating textual instructions and a meta-domain strategy for better generalization.
Findings
Outperforms state-of-the-art models in accuracy.
Reduces training redundancy across multiple domains.
Effectively decouples content and style instructions.
Abstract
Face anti-spoofing (FAS) aims to construct a robust system that can withstand diverse attacks. While recent efforts have concentrated mainly on cross-domain generalization, two significant challenges persist: limited semantic understanding of attack types and training redundancy across domains. We address the first by integrating vision-language models (VLMs) to enhance the perception of visual input. For the second challenge, we employ a meta-domain strategy to learn a unified model that generalizes well across multiple domains. Our proposed InstructFLIP is a novel instruction-tuned framework that leverages VLMs to enhance generalization via textual guidance trained solely on a single domain. At its core, InstructFLIP explicitly decouples instructions into content and style components, where content-based instructions focus on the essential semantics of spoofing, and style-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
