SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large   Language Models

Xianfu Cheng; Wei Zhang; Shiwei Zhang; Jian Yang; Xiangyuan Guan,; Xianjie Wu; Xiang Li; Ge Zhang; Jiaheng Liu; Yuying Mai; Yutao Zeng; Zhoufutu; Wen; Ke Jin; Baorui Wang; Weixiao Zhou; Yunhong Lu; Tongliang Li; Wenhao; Huang; Zhoujun Li

arXiv:2502.13059·cs.CL·February 19, 2025

SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan,, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu, Wen, Ke Jin, Baorui Wang, Weixiao Zhou, Yunhong Lu, Tongliang Li, Wenhao, Huang, Zhoujun Li

PDF

Open Access 1 Datasets

TL;DR

SimpleVQA is a new benchmark designed to evaluate the factual accuracy of multimodal large language models in answering natural language questions, covering diverse tasks and ensuring high-quality, challenging queries for comprehensive assessment.

Contribution

This work introduces SimpleVQA, the first comprehensive multi-modal benchmark for evaluating the factuality of MLLMs, with a robust quality control and evaluation framework.

Findings

01

Leading MLLMs show varied performance on factuality tasks.

02

SimpleVQA reveals common error patterns in multimodal models.

03

Benchmark facilitates targeted improvements in MLLMs' factual reasoning.

Abstract

The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. SimpleVQA is characterized by six key features: it covers multiple tasks and multiple scenarios, ensures high quality and challenging queries, maintains static and timeless reference answers, and is straightforward to evaluate. Our approach involves categorizing visual question-answering items into 9 different tasks around objective events or common knowledge and situating these within 9 topics. Rigorous quality control processes are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

m-a-p/SimpleVQA
dataset· 1.5k dl
1.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Natural Language Processing Techniques