ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models

Weifei Jin; Yuxin Cao; Junjie Su; Minhui Xue; Jie Hao; Ke Xu; Jin Song Dong; Derui Wang

arXiv:2510.26096·cs.SD·October 31, 2025

ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models

Weifei Jin, Yuxin Cao, Junjie Su, Minhui Xue, Jie Hao, Ke Xu, Jin Song Dong, Derui Wang

PDF

1 Datasets

TL;DR

ALMGuard is a novel defense framework that identifies universal safety shortcuts in Audio-Language Models, effectively reducing jailbreak attack success rates while preserving model utility.

Contribution

This paper introduces ALMGuard, the first tailored defense for ALMs, utilizing Shortcut Activation Perturbations and Mel-Gradient Sparse Mask to enhance robustness against specific adversarial attacks.

Findings

01

Reduces jailbreak success rate to 4.6% across four models

02

Maintains comparable utility on benign benchmarks

03

Demonstrates robustness against seen and unseen attacks

Abstract

Recent advances in Audio-Language Models (ALMs) have significantly improved multimodal understanding capabilities. However, the introduction of the audio modality also brings new and unique vulnerability vectors. Previous studies have proposed jailbreak attacks that specifically target ALMs, revealing that defenses directly transferred from traditional audio adversarial attacks or text-based Large Language Model (LLM) jailbreaks are largely ineffective against these ALM-specific threats. To address this issue, we propose ALMGuard, the first defense framework tailored to ALMs. Based on the assumption that safety-aligned shortcuts naturally exist in ALMs, we design a method to identify universal Shortcut Activation Perturbations (SAPs) that serve as triggers that activate the safety shortcuts to safeguard ALMs at inference time. To better sift out effective triggers while preserving the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

WeifeiJin/AdvBench-Audio
dataset· 270 dl
270 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.