EmoAttack: Utilizing Emotional Voice Conversion for Speech Backdoor   Attacks on Deep Speech Classification Models

Wenhan Yao; Zedong XingXiarun Chen; Jia Liu; yongqiang He; Weiping Wen

arXiv:2408.15508·cs.SD·September 9, 2024

EmoAttack: Utilizing Emotional Voice Conversion for Speech Backdoor Attacks on Deep Speech Classification Models

Wenhan Yao, Zedong XingXiarun Chen, Jia Liu, yongqiang He, Weiping Wen

PDF

Open Access

TL;DR

This paper introduces EmoAttack, a novel speech backdoor attack method leveraging emotional voice conversion to exploit emotional attributes in speech, demonstrating high success rates on speech classification models.

Contribution

The paper proposes EmoAttack, the first to use emotional voice conversion as a trigger for speech backdoor attacks, highlighting the importance of emotion in attack effectiveness.

Findings

01

EmoAttack achieves high attack success rates.

02

Speech with intense emotion is more vulnerable.

03

Effective on keyword spotting and speaker verification tasks.

Abstract

Deep speech classification tasks, mainly including keyword spotting and speaker verification, play a crucial role in speech-based human-computer interaction. Recently, the security of these technologies has been demonstrated to be vulnerable to backdoor attacks. Specifically speaking, speech samples are attacked by noisy disruption and component modification in present triggers. We suggest that speech backdoor attacks can strategically focus on emotion, a higher-level subjective perceptual attribute inherent in speech. Furthermore, we proposed that emotional voice conversion technology can serve as the speech backdoor attack trigger, and the method is called EmoAttack. Based on this, we conducted attack experiments on two speech classification tasks, showcasing that EmoAttack method owns impactful trigger effectiveness and its remarkable attack success rate and accuracy variance.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsFocus