AVTrustBench: Assessing and Enhancing Reliability and Robustness in   Audio-Visual LLMs

Sanjoy Chowdhury; Sayan Nag; Subhrajyoti Dasgupta; Yaoting Wang,; Mohamed Elhoseiny; Ruohan Gao; Dinesh Manocha

arXiv:2501.02135·cs.CV·January 7, 2025

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Yaoting Wang,, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

PDF

Open Access

TL;DR

AVTrustBench is a comprehensive benchmark with 600K samples designed to evaluate and improve the reliability and robustness of audio-visual large language models across adversarial, compositional, and modality-specific tasks.

Contribution

The paper introduces AVTrustBench, the first extensive AV multi-task benchmark, and proposes CAVPref, a calibration training strategy to enhance AVLLMs' robustness and reliability.

Findings

01

Most existing AVLLMs perform poorly on human-like comprehension.

02

CAVPref improves model robustness by up to 30.19%.

03

Benchmark and code will be publicly released.

Abstract

With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate the capabilities of AVLLMs to calibrate their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security