MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal   Large Language Models

Yichi Zhang; Yao Huang; Yitong Sun; Chang Liu; Zhe Zhao; Zhengwei; Fang; Yifan Wang; Huanran Chen; Xiao Yang; Xingxing Wei; Hang Su; Yinpeng; Dong; Jun Zhu

arXiv:2406.07057·cs.CL·December 9, 2024

MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models

Yichi Zhang, Yao Huang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei, Fang, Yifan Wang, Huanran Chen, Xiao Yang, Xingxing Wei, Hang Su, Yinpeng, Dong, Jun Zhu

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces MultiTrust, a comprehensive benchmark evaluating trustworthiness of multimodal large language models across five key aspects, revealing new risks and challenges specific to multimodal integration.

Contribution

It establishes the first unified benchmark for trustworthiness of MLLMs, covering five aspects and providing a scalable toolbox for future research.

Findings

01

MLLMs struggle with visual confusion and adversarial attacks.

02

Privacy, bias, and ideological risks are amplified in multimodal settings.

03

Proprietary models reveal vulnerabilities in trustworthiness.

Abstract

Despite the superior capabilities of Multimodal Large Language Models (MLLMs) across diverse tasks, they still face significant trustworthiness challenges. Yet, current literature on the assessment of trustworthy MLLMs remains limited, lacking a holistic evaluation to offer thorough insights into future improvements. In this work, we establish MultiTrust, the first comprehensive and unified benchmark on the trustworthiness of MLLMs across five primary aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets. Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks, highlighting the complexities introduced by the multimodality and underscoring the necessity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

thu-ml/MultiTrust
dataset· 1.0k dl
1.0k dl

Videos

MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models· slideslive

Taxonomy

TopicsTopic Modeling

MethodsBalanced Selection