Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models
Hao Yang, Lizhen Qu, Ehsan Shareghi, Gholamreza Haffari

TL;DR
This paper introduces an unsupervised safety-fine-tuning method for Large Audio Language Models that improves safety alignment with minimal increase in over-rejection, addressing safety issues without compromising helpfulness.
Contribution
It proposes a novel unsupervised safety-fine-tuning strategy that reshapes the model's representation space to balance safety and over-rejection in LALMs.
Findings
Significant safety improvements across three LALM generations.
Over-rejection rate increases by only 0.88% on average.
Effective safety enhancement under multiple input modalities.
Abstract
Large Audio Language Models (LALMs) have extended the capabilities of Large Language Models (LLMs) by enabling audio-based human interactions. However, recent research has revealed that LALMs remain vulnerable to harmful queries due to insufficient safety-alignment. Despite advances in defence measures for text and vision LLMs, effective safety-alignment strategies and audio-safety dataset specifically targeting LALMs are notably absent. Meanwhile defence measures based on Supervised Fine-tuning (SFT) struggle to address safety improvement while avoiding over-rejection issues, significantly compromising helpfulness. In this work, we propose an unsupervised safety-fine-tuning strategy as remedy that reshapes model's representation space to enhance existing LALMs safety-alignment while balancing the risk of over-rejection. Our experiments, conducted across three generations of Qwen LALMs,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
