Towards Reliable Large Audio Language Model

Ziyang Ma; Xiquan Li; Yakun Song; Wenxi Chen; Chenpeng Du; Jian Wu; Yuanzhe Chen; Zhuo Chen; Yuping Wang; Yuxuan Wang; Xie Chen

arXiv:2505.19294·cs.SD·May 27, 2025

Towards Reliable Large Audio Language Model

Ziyang Ma, Xiquan Li, Yakun Song, Wenxi Chen, Chenpeng Du, Jian Wu, Yuanzhe Chen, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

PDF

Open Access

TL;DR

This paper explores methods to improve the reliability of large audio language models (LALMs), introducing a new evaluation metric and demonstrating that both training-free and training-based approaches can enhance model reliability across various audio modalities.

Contribution

It systematically investigates reliability enhancement techniques for LALMs, introduces the Reliability Gain Index (RGI), and analyzes the transferability of reliability awareness across different audio modalities.

Findings

01

Both training-free and training-based methods improve LALM reliability.

02

The proposed RGI metric effectively evaluates reliability improvements.

03

Reliability awareness can transfer across sound, music, and speech modalities.

Abstract

Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don't know proactively. While there have been successful attempts to enhance the reliability of LLMs, reliable LALMs remain largely unexplored. In this paper, we systematically investigate various approaches towards reliable LALMs, including training-free methods such as multi-modal chain-of-thought (MCoT), and training-based methods such as supervised fine-tuning (SFT). Besides, we identify the limitations of previous evaluation metrics and propose a new metric, the Reliability Gain Index (RGI), to assess the effectiveness of different reliable methods. Our findings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing