Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling

Xuanjun Chen; Shih-Peng Cheng; Jiawei Du; Lin Zhang; Xiaoxiao Miao; Chung-Che Wang; Haibin Wu; Hung-yi Lee; Jyh-Shing Roger Jang

arXiv:2508.02000·cs.SD·August 5, 2025

Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling

Xuanjun Chen, Shih-Peng Cheng, Jiawei Du, Lin Zhang, Xiaoxiao Miao, Chung-Che Wang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang

PDF

Open Access

TL;DR

This paper introduces HBMNet, a hierarchical network for localizing deepfake regions in audio-visual content, effectively handling partial manipulations by integrating multi-scale cues and boundary-content relationships.

Contribution

The paper presents a novel Hierarchical Boundary Modeling Network that combines multi-scale temporal cues and boundary-content relationships for improved deepfake localization.

Findings

01

HBMNet outperforms existing methods BA-TFD and UMMAFormer.

02

Frame-level supervision enhances recall in localization.

03

Multi-scale and bidirectional modeling improve precision and overall performance.

Abstract

Audio-visual temporal deepfake localization under the content-driven partial manipulation remains a highly challenging task. In this scenario, the deepfake regions are usually only spanning a few frames, with the majority of the rest remaining identical to the original. To tackle this, we propose a Hierarchical Boundary Modeling Network (HBMNet), which includes three modules: an Audio-Visual Feature Encoder that extracts discriminative frame-level representations, a Coarse Proposal Generator that predicts candidate boundary regions, and a Fine-grained Probabilities Generator that refines these proposals using bidirectional boundary-content probabilities. From the modality perspective, we enhance audio-visual learning through dedicated encoding and fusion, reinforced by frame-level supervision to boost discriminability. From the temporal perspective, HBMNet integrates multi-scale cues…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing