When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Hiskias Dingeto; Taeyoun Kwon; Dasol Choi; Bodam Kim; DongGeon Lee; Haon Park; JaeHoon Lee; Jongho Shin

arXiv:2508.03365·cs.SD·February 5, 2026

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, Bodam Kim, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin

PDF

TL;DR

This paper presents WhisperInject, a novel two-stage adversarial attack framework that subtly manipulates audio inputs to jailbreak audio-language models, revealing practical vulnerabilities in multimodal AI systems.

Contribution

The paper introduces WhisperInject, combining reward-based optimization and gradient-based payload injection to effectively attack state-of-the-art audio-language models.

Findings

01

Achieves 60-78% success rate across multiple benchmarks and models.

02

Demonstrates practical, covert audio adversarial attacks.

03

Reveals vulnerabilities in multimodal AI systems.

Abstract

As large language models (LLMs) become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that manipulates state-of-the-art audio language models to generate harmful content. Our method embeds harmful payloads as subtle perturbations into audio inputs that remain intelligible to human listeners. The first stage uses a novel reward-based white-box optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to jailbreak the target model and elicit harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use gradient-based optimization to embed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.