Scores Know Bobs Voice: Speaker Impersonation Attack

Chanwoo Hwang; Sunpill Kim; Yong Kiam Tan; Tianchi Liu; Seunghun Paik; Dongsoo Kim; Mondal Soumik; Khin Mi Mi Aung; Jae Hong Seo

arXiv:2603.02781·cs.CR·March 4, 2026

Scores Know Bobs Voice: Speaker Impersonation Attack

Chanwoo Hwang, Sunpill Kim, Yong Kiam Tan, Tianchi Liu, Seunghun Paik, Dongsoo Kim, Mondal Soumik, Khin Mi Mi Aung, Jae Hong Seo

PDF

Open Access

TL;DR

This paper introduces a feature-aligned inversion attack framework that significantly enhances the efficiency of speaker impersonation attacks on recognition systems by aligning latent spaces with speaker features, reducing queries needed.

Contribution

The paper proposes a novel inversion-based attack method that aligns latent and feature spaces, enabling more efficient and effective score-based impersonation attacks on speaker recognition systems.

Findings

01

Achieves up to 91.65% attack success with only 50 queries.

02

On average, requires 10x fewer queries than previous methods.

03

Enables new subspace-projection attack paradigm.

Abstract

Advances in deep learning have enabled the widespread deployment of speaker recognition systems (SRSs), yet they remain vulnerable to score-based impersonation attacks. Existing attacks that operate directly on raw waveforms require a large number of queries due to the difficulty of optimizing in high-dimensional audio spaces. Latent-space optimization within generative models offers improved efficiency, but these latent spaces are shaped by data distribution matching and do not inherently capture speaker-discriminative geometry. As a result, optimization trajectories often fail to align with the adversarial direction needed to maximize victim scores. To address this limitation, we propose an inversion-based generative attack framework that explicitly aligns the latent space of the synthesis model with the discriminative feature space of SRSs. We first analyze the requirements of an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Adversarial Robustness in Machine Learning · Emotion and Mood Recognition