Partially Fake Audio Detection by Self-attention-based Fake Span Discovery
Haibin Wu, Heng-Cheng Kuo, Naijun Zheng, Kuo-Hsuan Hung, Hung-Yi Lee,, Yu Tsao, Hsin-Min Wang, Helen Meng

TL;DR
This paper introduces a self-attention-based fake span discovery framework to detect partially fake audio clips, addressing emerging threats from advanced speech synthesis and voice conversion technologies.
Contribution
It proposes a novel fake span detection module using question-answering strategy and self-attention to improve detection of partially fake audios, a new challenge in audio deep synthesis detection.
Findings
Ranked second in ADD 2022 partially fake audio detection track.
Effective fake span localization within partially fake audios.
Enhanced generalization in detecting manipulated audio segments.
Abstract
The past few years have witnessed the significant advances of speech synthesis and voice conversion technologies. However, such technologies can undermine the robustness of broadly implemented biometric identification models and can be harnessed by in-the-wild attackers for illegal uses. The ASVspoof challenge mainly focuses on synthesized audios by advanced speech synthesis and voice conversion models, and replay attacks. Recently, the first Audio Deep Synthesis Detection challenge (ADD 2022) extends the attack scenarios into more aspects. Also ADD 2022 is the first challenge to propose the partially fake audio detection task. Such brand new attacks are dangerous and how to tackle such attacks remains an open question. Thus, we propose a novel framework by introducing the question-answering (fake span discovery) strategy with the self-attention mechanism to detect partially fake…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Digital Media Forensic Detection
