Sirens' Whisper: Inaudible Near-Ultrasonic Jailbreaks of Speech-Driven LLMs
Zijian Ling, Pingyi Hu, Xiuyong Gao, Xiaojing Ma, Man Zhou, Jun Feng, Songfeng Lu, Dongmei Zhang, Bin Benjamin Zhu

TL;DR
The paper introduces SWhisper, a practical covert acoustic channel that enables inaudible prompt injections into speech-driven LLMs, demonstrating high effectiveness and imperceptibility in real-world scenarios.
Contribution
SWhisper is the first framework to achieve robust, inaudible prompt-based attacks on speech-driven LLMs using commodity hardware under black-box conditions.
Findings
Achieves up to 0.94 non-refusal rate on commercial models
Demonstrates high transferability of jailbreak prompts
Injected prompts are perceptually indistinguishable from background sounds
Abstract
Speech-driven large language models (LLMs) are increasingly accessed through speech interfaces, introducing new security risks via open acoustic channels. We present Sirens' Whisper (SWhisper), the first practical framework for covert prompt-based attacks against speech-driven LLMs under realistic black-box conditions using commodity hardware. SWhisper enables robust, inaudible delivery of arbitrary target baseband audio-including long and structured prompts-on commodity devices by encoding it into near-ultrasound waveforms that demodulate faithfully after acoustic transmission and microphone nonlinearity. This is achieved through a simple yet effective approach to modeling nonlinear channel characteristics across devices and environments, combined with lightweight channel-inversion pre-compensation. Building on this high-fidelity covert channel, we design a voice-aware jailbreak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Speech Recognition and Synthesis · Speech and Audio Processing
