Detection and Evaluation of human and machine generated speech in spoofing attacks on automatic speaker verification systems
Yang Gao, Jiachen Lian, Bhiksha Raj, Rita Singh

TL;DR
This paper evaluates the vulnerability of automatic speaker verification systems to both human and machine-generated speech spoofing attacks, proposing features to improve detection of deepfaked audio.
Contribution
It compares human impersonation and machine-generated speech attacks on ASV systems and introduces speech production features for more robust deepfake detection.
Findings
Features like fundamental frequency entropy help detect deepfakes.
Machine-generated speech is more challenging to detect than human impersonation.
Proposed features improve detection robustness against unknown deepfake methods.
Abstract
Automatic speaker verification (ASV) systems utilize the biometric information in human speech to verify the speaker's identity. The techniques used for performing speaker verification are often vulnerable to malicious attacks that attempt to induce the ASV system to return wrong results, allowing an impostor to bypass the system and gain access. Attackers use a multitude of spoofing techniques for this, such as voice conversion, audio replay, speech synthesis, etc. In recent years, easily available tools to generate deepfaked audio have increased the potential threat to ASV systems. In this paper, we compare the potential of human impersonation (voice disguise) based attacks with attacks based on machine-generated speech, on black-box and white-box ASV systems. We also study countermeasures by using features that capture the unique aspects of human speech production, under the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
