KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features
Ivan Kukanov, Jun Wah Ng

TL;DR
This paper introduces a multimodal deepfake detection method combining handcrafted visual features and SSL-based audio representations, achieving high accuracy and robustness against unseen deepfake attacks.
Contribution
It proposes a novel multimodal approach utilizing handcrafted visual features and SSL-based audio models for improved deepfake detection and localization.
Findings
Achieved 92.78% AUC on deepfake classification
Attained 0.3536 IoU for temporal localization using audio
Balances detection performance with real-world applicability
Abstract
The rapid development of audio-driven talking head generators and advanced Text-To-Speech (TTS) models has led to more sophisticated temporal deepfakes. These advances highlight the need for robust methods capable of detecting and localizing deepfakes, even under novel, unseen attack scenarios. Current state-of-the-art deepfake detectors, while accurate, are often computationally expensive and struggle to generalize to novel manipulation techniques. To address these challenges, we propose multimodal approaches for the AV-Deepfake1M 2025 challenge. For the visual modality, we leverage handcrafted features to improve interpretability and adaptability. For the audio modality, we adapt a self-supervised learning (SSL) backbone coupled with graph attention networks to capture rich audio representations, improving detection robustness. Our approach strikes a balance between performance and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning · Multimodal Machine Learning Applications
