HOLA: Enhancing Audio-visual Deepfake Detection via Hierarchical Contextual Aggregations and Efficient Pre-training

Xuecheng Wu; Danlei Huang; Heli Sun; Xinyi Yin; Yifan Wang; Hao Wang; Jia Zhang; Fei Wang; Peihao Guo; Suyu Xing; Junxiao Xue; Liang He

arXiv:2507.22781·cs.CV·July 31, 2025

HOLA: Enhancing Audio-visual Deepfake Detection via Hierarchical Contextual Aggregations and Efficient Pre-training

Xuecheng Wu, Danlei Huang, Heli Sun, Xinyi Yin, Yifan Wang, Hao Wang, Jia Zhang, Fei Wang, Peihao Guo, Suyu Xing, Junxiao Xue, Liang He

PDF

TL;DR

HOLA is a novel audio-visual deepfake detection framework that leverages hierarchical contextual modeling and large-scale self-supervised pre-training, achieving top performance in the 2025 Deepfakes Detection Challenge.

Contribution

The paper introduces a unified two-stage framework with hierarchical contextual modeling and a pseudo supervised signal injection strategy for improved deepfake detection.

Findings

01

Ranks 1st in the 2025 Deepfakes Detection Challenge

02

Outperforms previous methods with 0.0476 higher AUC score

03

Demonstrates effectiveness through extensive experiments and ablation studies

Abstract

Advances in Generative AI have made video-level deepfake detection increasingly challenging, exposing the limitations of current detection techniques. In this paper, we present HOLA, our solution to the Video-Level Deepfake Detection track of 2025 1M-Deepfakes Detection Challenge. Inspired by the success of large-scale pre-training in the general domain, we first scale audio-visual self-supervised pre-training in the multimodal video-level deepfake detection, which leverages our self-built dataset of 1.81M samples, thereby leading to a unified two-stage framework. To be specific, HOLA features an iterative-aware cross-modal learning module for selective audio-visual interactions, hierarchical contextual modeling with gated aggregations under the local-global perspective, and a pyramid-like refiner for scale-aware cross-grained semantic enhancements. Moreover, we propose the pseudo…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.