"See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models

Jihao Gu; Yingyao Wang; Pi Bu; Chen Wang; Ziming Wang; Tengtao Song; Donglai Wei; Jiale Yuan; Yingxiu Zhao; Yancheng He; Shilong Li; Jiaheng Liu; Meng Cao; Jun Song; Yingshui Tan; Xiang Li; Wenbo Su; Zhicheng Zheng; Xiaoyong Zhu; Bo Zheng

arXiv:2502.11718·cs.CL·June 2, 2025

"See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models

Jihao Gu, Yingyao Wang, Pi Bu, Chen Wang, Ziming Wang, Tengtao Song, Donglai Wei, Jiale Yuan, Yingxiu Zhao, Yancheng He, Shilong Li, Jiaheng Liu, Meng Cao, Jun Song, Yingshui Tan, Xiang Li, Wenbo Su, Zhicheng Zheng, Xiaoyong Zhu, Bo Zheng

PDF

Open Access 1 Datasets

TL;DR

This paper introduces ChineseSimpleVQA, a benchmark for evaluating the factual accuracy of large vision language models in Chinese, highlighting their knowledge capabilities and limitations through diverse, high-quality visual question-answering tasks.

Contribution

It presents the first Chinese factuality-based visual question-answering benchmark and a data construction pipeline that decouples visual recognition and knowledge discovery.

Findings

01

Identifies significant performance gaps in current LVLMs.

02

Provides a comprehensive evaluation of 34 models.

03

Highlights the importance of visual factuality assessment.

Abstract

The evaluation of factual accuracy in large vision language models (LVLMs) has lagged behind their rapid development, making it challenging to fully reflect these models' knowledge capacity and reliability. In this paper, we introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA, aimed at assessing the visual factuality of LVLMs across 8 major topics and 56 subtopics. The key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers. Moreover, we contribute a rigorous data construction pipeline and decouple the visual factuality into two parts: seeing the world (i.e., object recognition) and discovering knowledge. This decoupling allows us to analyze the capability boundaries and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

OpenStellarTeam/Chinese-SimpleVQA
dataset· 36 dl
36 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods

MethodsFocus