Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network
Junyan Wu, Wenbo Xu, Wei Lu, Xiangyang Luo, Rui Yang, Shize Guo

TL;DR
This paper introduces a weakly-supervised audio forgery localization method that uses progressive co-learning and self-supervision to accurately identify forged regions without requiring fine-grained annotations.
Contribution
The paper proposes a novel progressive audio-language co-learning network (LOCO) that leverages semantic priors and self-supervised refinement for audio forgery localization under weak supervision.
Findings
Achieves state-of-the-art performance on three benchmarks.
Effectively utilizes semantic priors for forgery detection.
Improves localization accuracy with progressive pseudo-label refinement.
Abstract
Audio temporal forgery localization (ATFL) aims to find the precise forgery regions of the partial spoof audio that is purposefully modified. Existing ATFL methods rely on training efficient networks using fine-grained annotations, which are obtained costly and challenging in real-world scenarios. To meet this challenge, in this paper, we propose a progressive audio-language co-learning network (LOCO) that adopts co-learning and self-supervision manners to prompt localization performance under weak supervision scenarios. Specifically, an audio-language co-learning module is first designed to capture forgery consensus features by aligning semantics from temporal and global perspectives. In this module, forgery-aware prompts are constructed by using utterance-level annotations together with learnable prompts, which can incorporate semantic priors into temporal content features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Music and Audio Processing
MethodsContrastive Learning
