AdvWave: Stealthy Adversarial Jailbreak Attack against Large   Audio-Language Models

Mintong Kang; Chejian Xu; Bo Li

arXiv:2412.08608·cs.SD·December 12, 2024

AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models

Mintong Kang, Chejian Xu, Bo Li

PDF

Open Access

TL;DR

This paper introduces AdvWave, a novel framework for stealthy adversarial attacks on large audio-language models, overcoming technical challenges like gradient shattering and behavioral variability to effectively jailbreak these models.

Contribution

AdvWave is the first comprehensive jailbreak framework for LALMs, featuring a dual-phase optimization, adaptive target search, and classifier-guided naturalistic adversarial audio generation.

Findings

01

Achieves 40% higher success rate than baseline methods.

02

Effectively overcomes gradient shattering in LALMs.

03

Generates perceptually natural adversarial audio.

Abstract

Recent advancements in large audio-language models (LALMs) have enabled speech-based user interactions, significantly enhancing user experience and accelerating the deployment of LALMs in real-world applications. However, ensuring the safety of LALMs is crucial to prevent risky outputs that may raise societal concerns or violate AI regulations. Despite the importance of this issue, research on jailbreaking LALMs remains limited due to their recent emergence and the additional technical challenges they present compared to attacks on DNN-based audio models. Specifically, the audio encoders in LALMs, which involve discretization operations, often lead to gradient shattering, hindering the effectiveness of attacks relying on gradient-based optimizations. The behavioral variability of LALMs further complicates the identification of effective (adversarial) optimization targets. Moreover,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Speech Recognition and Synthesis · Adversarial Robustness in Machine Learning