CAST: Cross-Attentive Spatio-Temporal feature fusion for deepfake detection
Aryan Thakre, Omkar Nagwekar, Vedang Talekar, Aparna Santra Biswas

TL;DR
The paper introduces CAST, a novel deepfake detection model that employs cross-attention to fuse spatial and temporal features, significantly improving detection accuracy and robustness across various datasets.
Contribution
CAST is the first model to use cross-attention for integrated spatio-temporal feature fusion in deepfake detection, enabling more precise and context-aware identification of manipulations.
Findings
Achieves 99.49% AUC and 97.57% accuracy in intra-dataset tests.
Outperforms existing methods in cross-dataset evaluations with high AUC scores.
Effectively detects subtle, time-dependent artifacts in deepfake videos.
Abstract
Deepfakes have emerged as a significant threat to digital media authenticity, increasing the need for advanced detection techniques that can identify subtle and time-dependent manipulations. CNNs are effective at capturing spatial artifacts and Transformers excel at modeling temporal inconsistencies. However, many existing CNN-Transformer models process spatial and temporal features independently. In particular, attention based methods often use independent attention mechanisms for spatial and temporal features and combine them using naive approaches like averaging, addition or concatenation, limiting the depth of spatio-temporal interaction. To address this challenge, we propose a unified CAST model that leverages cross-attention to effectively fuse spatial and temporal features in a more integrated manner. Our approach allows temporal features to dynamically attend to relevant spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
