AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models
Mintong Kang, Chejian Xu, Bo Li

TL;DR
This paper introduces AdvWave, a novel framework for stealthy adversarial attacks on large audio-language models, overcoming technical challenges like gradient shattering and behavioral variability to effectively jailbreak these models.
Contribution
AdvWave is the first comprehensive jailbreak framework for LALMs, featuring a dual-phase optimization, adaptive target search, and classifier-guided naturalistic adversarial audio generation.
Findings
Achieves 40% higher success rate than baseline methods.
Effectively overcomes gradient shattering in LALMs.
Generates perceptually natural adversarial audio.
Abstract
Recent advancements in large audio-language models (LALMs) have enabled speech-based user interactions, significantly enhancing user experience and accelerating the deployment of LALMs in real-world applications. However, ensuring the safety of LALMs is crucial to prevent risky outputs that may raise societal concerns or violate AI regulations. Despite the importance of this issue, research on jailbreaking LALMs remains limited due to their recent emergence and the additional technical challenges they present compared to attacks on DNN-based audio models. Specifically, the audio encoders in LALMs, which involve discretization operations, often lead to gradient shattering, hindering the effectiveness of attacks relying on gradient-based optimizations. The behavioral variability of LALMs further complicates the identification of effective (adversarial) optimization targets. Moreover,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Speech Recognition and Synthesis · Adversarial Robustness in Machine Learning
