Audio Adversarial Examples: Attacks Using Vocal Masks

Kai Yuan Tay; Lynnette Ng; Wei Han Chua; Lucerne Loke; Danqi Ye,; Melissa Chua

arXiv:2102.02417·cs.SD·February 9, 2021

Audio Adversarial Examples: Attacks Using Vocal Masks

Kai Yuan Tay, Lynnette Ng, Wei Han Chua, Lucerne Loke, Danqi Ye,, Melissa Chua

PDF

Open Access

TL;DR

This paper introduces a novel audio adversarial attack method using vocal masks that successfully fools state-of-the-art speech-to-text systems while remaining recognizable to humans, highlighting vulnerabilities in current ASR models.

Contribution

The authors propose a new vocal mask-based adversarial attack that effectively deceives multiple leading speech-to-text systems without hindering human comprehension.

Findings

01

Adversarial examples fool SOTA STT systems

02

Humans can still accurately transcribe masked audio

03

The attack reveals vulnerabilities in current ASR models

Abstract

We construct audio adversarial examples on automatic Speech-To-Text systems . Given any audio waveform, we produce an another by overlaying an audio vocal mask generated from the original audio. We apply our audio adversarial attack to five SOTA STT systems: DeepSpeech, Julius, Kaldi, wav2letter@anywhere and CMUSphinx. In addition, we engaged human annotators to transcribe the adversarial audio. Our experiments show that these adversarial examples fool State-Of-The-Art Speech-To-Text systems, yet humans are able to consistently pick out the speech. The feasibility of this attack introduces a new domain to study machine and human perception of speech.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection · Speech Recognition and Synthesis · Adversarial Robustness in Machine Learning