Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
Jaechul Roh, Amir Houmansadr

TL;DR
This paper systematically studies how benign fine-tuning affects safety in Audio LLMs, revealing architecture-dependent vulnerabilities and proposing mitigation strategies to prevent safety degradation.
Contribution
It introduces the first comprehensive analysis of benign fine-tuning safety in Audio LLMs, decomposing vulnerabilities into semantic and acoustic axes and proposing effective defenses.
Findings
Benign fine-tuning can increase jailbreak success rate up to 87.12%.
Vulnerability axes and risks are architecture-dependent.
Filtering data and prompt-based defenses can nearly eliminate safety risks.
Abstract
Prior work shows that fine-tuning aligned models on benign data degrades safety in text and vision modalities, and that proximity to harmful content in representation space predicts which samples cause the most damage. However, existing analyses operate within a single, undifferentiated embedding space -- leaving open whether distinct input properties drive the vulnerability differently. Audio introduces a structurally richer problem: a benign sample can neighbor harmful content not only through what is said but through how it sounds, even when its words are entirely innocuous. We present the first systematic study of benign fine-tuning safety in Audio LLMs, evaluating three state-of-the-art models with a proximity-based filtering framework that selects benign audio by embedding-space distance to harmful content. By decomposing proximity into semantic, acoustic, and mixed axes using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
