Securing Voice-driven Interfaces against Fake (Cloned) Audio Attacks
Hafiz Malik

TL;DR
This paper presents a method using higher-order spectral analysis to detect fake cloned speech, addressing security threats posed by advanced voice synthesis technologies.
Contribution
It introduces a novel detection approach leveraging spectral artifacts to distinguish genuine speech from cloned audio, improving security in voice interfaces.
Findings
Near-perfect detection rate of cloned speech
Effective on multiple cloning approaches
Robust against different synthesis artifacts
Abstract
Voice cloning technologies have found applications in a variety of areas ranging from personalized speech interfaces to advertisement, robotics, and so on. Existing voice cloning systems are capable of learning speaker characteristics and use trained models to synthesize a person's voice from only a few audio samples. Advances in cloned speech generation technologies are capable of generating perceptually indistinguishable speech from a bona-fide speech. These advances pose new security and privacy threats to voice-driven interfaces and speech-based access control systems. The state-of-the-art speech synthesis technologies use trained or tuned generative models for cloned speech generation. Trained generative models rely on linear operations, learned weights, and excitation source for cloned speech synthesis. These systems leave characteristic artifacts in the synthesized speech.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
