AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Lionel Z. Wang, Shun Zhang, Xingjian Du, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Gelei Deng, Haoyang Li, Yiming Li, Xiaobin Zhuang, Tianlong Chen

TL;DR
AudioTrust introduces a comprehensive evaluation framework for assessing trustworthiness of Audio Large Language Models across multiple dimensions, addressing unique audio-specific vulnerabilities like acoustic cues and background noise.
Contribution
This work presents the first large-scale, systematic benchmark specifically designed for evaluating ALLMs' trustworthiness in real-world audio scenarios, covering six key dimensions.
Findings
Identified significant trustworthiness risks from acoustic cues such as timbre and background noise.
Revealed limitations and failure modes of 14 state-of-the-art ALLMs under diverse high-risk audio scenarios.
Provided a publicly available benchmark and dataset for future research and development.
Abstract
The rapid development and widespread adoption of Audio Large Language Models (ALLMs) demand rigorous evaluation of their trustworthiness. However, existing evaluation frameworks are primarily designed for text and fail to capture vulnerabilities introduced by the acoustic properties of audio. We find that significant trustworthiness risks in ALLMs arise from non-semantic acoustic cues, such as timbre, accent, and background noise, which can be exploited to manipulate model behavior. To address this gap, we propose AudioTrust, the first large-scale and systematic framework for evaluating ALLM trustworthiness under audio-specific risks. AudioTrust covers six key dimensions: fairness, hallucination, safety, privacy, robustness, and authenticition. It includes 26 sub-tasks and a curated dataset of more than 4,420 audio samples collected from real-world scenarios, including daily…
Peer Reviews
Decision·ICLR 2026 Poster
- Audio-specific scope. The benchmark centers risks that are unique to audio, including bias from voice attributes, audio-grounded hallucinations, social-engineering safety failures, privacy leakage from speech, and spoofing in authentication. This makes the task design better aligned with acoustic realities than text-only frameworks. - Clear breadth and transparency. The benchmark spans 18 experimental configurations and evaluates 14 SOTA models using a curated set exceeding 4,420 audi
- Judge dependence. Results depend on GPT-4o as the primary judge, although humans verify, a single model family as scorer risks scorer bias. More ablations with alternative judges or dual-judge consensus would strengthen claims. - Metric calibration and comparability. Several dimensions rely on bespoke metrics, the individual metrics and the aggregate score may have potential normalization issues.
Originality: The proposed framework addresses a significant gap, namely, the need for audio-specific trustworthiness evaluation of ALLMs. Furthermore, the proposed evaluation strategies (e.g., construction of test data designed to probe specific vulnerabilities) are unique and valuable to future work. Quality: The paper is well-written with extensive supporting details. Clarity: The appendix provides a clear explanation of how each dataset was constructed and the evaluation method used, includi
There is limited discussion of WHY different models exhibit degrees of privacy, robustness, etc. Furthermore, little consideration is given to how these different privacy dimensions correlate with each other across the various models. The number of tasks in each trustworthiness domain is understandably limited to a few key examples. For example, stereotypes are assessed along the lines of math ability, doctor vs. nurse, etc. It would be helpful to understand why these specific examples were chos
This work is comprehensive and well-presented, proposing the largest and most comprehensive benchmark for systematically evaluating ALLM trustworthiness concerning audio-specific tasks. It also conducts extensive experiments to assess the performance of several advanced ALLMs on the proposed benchmark.
1. The reliance on GPT-4o as the primary evaluation model raises serious reproducibility and long-term comparability issues. If GPT-4o becomes inaccessible, future researchers may be unable to replicate or extend the reported results. Replacing or at least complementing GPT-4o with an open-source alternative would strengthen the benchmark’s sustainability and transparency. 2. The paper lacks sufficient detail about the data sources and ethical considerations underlying the benchmark. Since Audi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus · Sparse Evolutionary Training
