Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling
Xuanjun Chen, Shih-Peng Cheng, Jiawei Du, Lin Zhang, Xiaoxiao Miao, Chung-Che Wang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

TL;DR
This paper introduces HBMNet, a hierarchical network for localizing deepfake regions in audio-visual content, effectively handling partial manipulations by integrating multi-scale cues and boundary-content relationships.
Contribution
The paper presents a novel Hierarchical Boundary Modeling Network that combines multi-scale temporal cues and boundary-content relationships for improved deepfake localization.
Findings
HBMNet outperforms existing methods BA-TFD and UMMAFormer.
Frame-level supervision enhances recall in localization.
Multi-scale and bidirectional modeling improve precision and overall performance.
Abstract
Audio-visual temporal deepfake localization under the content-driven partial manipulation remains a highly challenging task. In this scenario, the deepfake regions are usually only spanning a few frames, with the majority of the rest remaining identical to the original. To tackle this, we propose a Hierarchical Boundary Modeling Network (HBMNet), which includes three modules: an Audio-Visual Feature Encoder that extracts discriminative frame-level representations, a Coarse Proposal Generator that predicts candidate boundary regions, and a Fine-grained Probabilities Generator that refines these proposals using bidirectional boundary-content probabilities. From the modality perspective, we enhance audio-visual learning through dedicated encoding and fusion, reinforced by frame-level supervision to boost discriminability. From the temporal perspective, HBMNet integrates multi-scale cues…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing
