Investigating self-supervised representations for audio-visual deepfake detection
Dragos-Alexandru Boldisor, Stefan Smeu, Dan Oneata, Elisabeta Oneata

TL;DR
This paper systematically evaluates self-supervised audio-visual representations for deepfake detection, demonstrating their effectiveness, interpretability, and complementarity across modalities, with audio-informed features achieving state-of-the-art results.
Contribution
It provides a comprehensive analysis of self-supervised features for deepfake detection across modalities and domains, highlighting their potential and limitations.
Findings
Self-supervised features capture deepfake-relevant information.
Features are complementary across modalities.
Audio-informed representations perform best and generalize well.
Abstract
Self-supervised representations excel at many vision and speech tasks, but their potential for audio-visual deepfake detection remains underexplored. Unlike prior work that uses these features in isolation or buried within complex architectures, we systematically evaluate them across modalities (audio, video, multimodal) and domains (lip movements, generic visual content). We assess three key dimensions: detection effectiveness, interpretability of encoded information, and cross-modal complementarity. We find that most self-supervised features capture deepfake-relevant information, and that this information is complementary. Moreover, models primarily attend to semantically meaningful regions rather than spurious artifacts (such as the leading silence). Among the investigated features, audio-informed representations generalize best and achieve state-of-the-art results. However,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The chosen self-supervised audiovisual representation are appropriate, as are the evaluation datasets. The results presented on the tested datasets seem reasonable, and are accompanied by qualitative analysis of temporal and spatial explanations. - The paper is well written and is easy to read and understand - The reader could benefit from understanding how the chosen self-supervised method does on the standard datasets, which could help them not redo the same results in their own work.
- The paper uses linear probing on top of pretrained self-supervised audiovisual models to evaluate on multiple downstream datasets. To me, this is akin to running a bunch of baselines on the tested datasets with no particular technical novelty. - A crucial technical drawback of the evaluation setting to me is the fact that individual frames are encoded separately by the audio/visual backbones before being aggregated. While frame-level information does provide useful information, the essence of
- The paper is well written - The study presented is very interesting and novel (to the best of my knowledge) - Results on multiple datasets are presented, both in-domain and out-of-domain - The authors attempt to identify the regions the models attend to, which is very interesting. - Overall, the conclusions are interesting.
- The auto-AVSR model performs better than AV-Hubert for visual and audio-visual speech recognition. Is there a hypothesis why it performs so much worse than AV-Hubert for deep fake detection. - Table 2, the performance of audio and visual AutoAVSR features is much lower than the audio-visual AutoAVSR features. This looks a bit weird, since the gap for AV-Hubert is much smaller. - Have the authors tried to combine features from multiple layers instead of using the last layer's features? - One of
- The paper is clearly written and easy to follow. - The flow from problem definition to empirical observation is natural and coherent, allowing readers to easily grasp the motivation and key findings.
- The novelty and contribution are rather limited. The proposed methods are straightforward, and the paper focuses more on analysis than on developing new algorithms. - While readability is good, the paper could be made more compact; for example, information such as Table 1 (basic model descriptions) belongs in the appendix rather than the main text. - The comparisons are not fully appropriate for a deepfake detection study. The paper contrasts self-supervised models (e.g., AV-HuBERT vs. AV-Hu
The paper compares a sufficient spectrum of SSL models across modalities, providing a well-rounded analysis of their suitability for multimodal deepfake detection. The linear-probe setup, log-sum-exp aggregation, and controlled training pipeline enable fair comparisons and isolate the representational power of the backbones. The inclusion of temporal and spatial explanations, as well as quantitative comparison to human click annotations, goes beyond conventional performance reporting (e.g., si
The paper offers limited practical novelty in its insights. The main conclusion that self-supervised representations perform well in-distribution but fail to generalize is not very surprising, as this pattern is well established in the deepfake detection literature (see [1] for a popular earlier work), where most methods struggle under domain shift. It is thus largely expected that self-supervised models trained in a similar way would show the same behavior. While the analysis is systematic, it
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis
