TL;DR
This paper introduces a new evaluation framework for DeepFake detection that emphasizes semantic inconsistencies between audio and video, revealing limitations of current models and proposing enhancements for more realistic detection.
Contribution
It extends existing four-class formulations by modeling semantic mismatches, introduces variants exposing architectural vulnerabilities, and proposes a semantic reinforcement strategy with ImageBind embeddings.
Findings
State-of-the-art models struggle with semantic mismatch data.
Three RARV-SMM variants reveal architectural vulnerabilities.
Semantic reinforcement improves detection performance.
Abstract
Current DeepFake detection scenarios are mostly binary, yet data manipulation can vary across audio, video, or both, whose variability is not captured in binary settings. Four-class audio-visual formulations address this by discriminating manipulation type, but introduce a unresolved problem: models may rely solely on data source integrity to detect DeepFakes without evaluating their semantic consistency. If the DeepFake origin is not in the data source but in its content, can semantic mismatch be assessed by the state-of-the-art? This paper proposes a new evaluation setup, extending the four-class formulation by explicitly modeling semantic-level inconsistency between authentic modalities with the introduction a new class: Real Audio-Real Video with Semantic Mismatch (RARV-SMM). We assess the robustness of state-of-the-art models in this new realistic DeepFake setting, using the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
