Denoising and Alignment: Rethinking Domain Generalization for Multimodal Face Anti-Spoofing
Yingjie Ma, Xun Lin, Zitong Yu, Xin Liu, Xiaochen Yuan, Weicheng Xie, Linlin Shen

TL;DR
This paper introduces the MMDA framework that leverages denoising, alignment, and pre-trained models to improve cross-domain generalization in multimodal face anti-spoofing, outperforming existing methods.
Contribution
The paper proposes a novel MMDA framework combining denoising, alignment, and U-DSA modules to enhance multimodal FAS generalization using CLIP's zero-shot capabilities.
Findings
Outperforms state-of-the-art in cross-domain tests
Enhances multimodal detection accuracy
Improves representation robustness
Abstract
Face Anti-Spoofing (FAS) is essential for the security of facial recognition systems in diverse scenarios such as payment processing and surveillance. Current multimodal FAS methods often struggle with effective generalization, mainly due to modality-specific biases and domain shifts. To address these challenges, we introduce the \textbf{M}ulti\textbf{m}odal \textbf{D}enoising and \textbf{A}lignment (\textbf{MMDA}) framework. By leveraging the zero-shot generalization capability of CLIP, the MMDA framework effectively suppresses noise in multimodal data through denoising and alignment mechanisms, thereby significantly enhancing the generalization performance of cross-modal alignment. The \textbf{M}odality-\textbf{D}omain Joint \textbf{D}ifferential \textbf{A}ttention (\textbf{MD2A}) module in MMDA concurrently mitigates the impacts of domain and modality noise by refining the attention…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper applied various techniques for aligning pre-trained space on source dataset and pushing modalities representation more compact. - All modules in the paper are well motivated and explained, with acceptable illustration. - The proposed MMDA method have shows its superior performance across evaluation protocol: unseen domain testing, limited training source domain, and missing modalities scenarios.
- The contribution of the U-shaped Dual Space Adaptation Module (U-DSA) is not well-studied. Specifically, the ablation study in Table 11 indicates that performance degrades whenever U-DSA is included (for example, comparing line 1 vs. line 3, and line 2 vs. line 4). The ablation study in Figure 3 solely shows the effect of the number of layers on output performance, not a comparison against a simpler alternative, such as using only MLP layers (Adapt). Meanwhile, the dilemma of deep versus shall
- Proposed a conceptually unified pipeline combining denoising, alignment, and adaptation. To address an important and practical problem: robust multimodal FAS under domain shift. - Module MD2A introduced a differential attention mechanism to deal with modality and domain biases. - Module RS2’s flexible text-subspace alignment is novel and intuitively appealing for CLIP-based multimodal tasks. - Comprehensive experiments on multiple datasets and settings.
- Questionable theoretical justification of MD2A: The claim that same-domain sample pairs isolate “noise” is not empirically proven; cross-domain comparisons or explicit visualization of noise components are missing. The denoising term appears more heuristic than theoretically motivated. - Insufficient clarity and ablation for U-DSA: The U-shaped Dual Space Adaptation is conceptually interesting but underexplained. The mechanism of “Remap” and its interaction with layer-wise RS2 losses are no
The framework is technically well-structured and empirically validated across multiple benchmarks (CeFA, PADISI, SURF, WMCA). The ablation studies are detailed, covering all three proposed modules. Visualization results (t-SNE) and efficiency analysis (Table 9) help make the method transparent and reproducible. The integration of CLIP-based alignment into multimodal FAS is timely and relevant.
1. Motivation is not empirically substantiated (two parts). (a) Figure 1 claims that modality bias makes the IR–Depth gap significantly larger than RGB–RGB, yet the paper provides no dedicated analysis (e.g., inter-modality feature distances, per-modality performance gaps, or distribution visualizations) to support this hypothesis. (b) The method is argued to “avoid overly smooth decision boundaries”, but the only evidence is a t-SNE plot; there is no direct boundary/margin analysis or causal li
* The paper presents a comprehensive experimental evaluation that effectively validates the effectiveness of MMDA under both protocols. The ablation studies for each component are also meticulous. Furthermore, the manuscript is well-structured and formatted.
* The motivation of this paper sounds limited. In line 72, the authors combine Figure 1 to explain the two difficulties in the multimodal DG problem of FAS: noise diversity and difficulty in alignment. However, these two difficulties can actually be unified into modality gap and domain gap, which has been mentioned many times in previous papers. Moreover, Figure 1 does not intuitively reflect the impact of modality and domain gap on existing methods. * The motivations and methodologies underly
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiometric Identification and Security · Reconstructive Facial Surgery Techniques
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · ALIGN
