CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs

Ai Jian; Weijie Qiu; Xiaokun Wang; Peiyu Wang; Yunzhuo Hao; Jiangbo Pei; Yichen Wei; Yi Peng; Xuchen Song

arXiv:2505.24120·cs.CV·June 18, 2025

CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs

Ai Jian, Weijie Qiu, Xiaokun Wang, Peiyu Wang, Yunzhuo Hao, Jiangbo Pei, Yichen Wei, Yi Peng, Xuchen Song

PDF

1 Datasets

TL;DR

CSVQA is a new Chinese multimodal benchmark designed to evaluate scientific reasoning in vision-language models, emphasizing domain knowledge, visual evidence, and complex reasoning in STEM contexts.

Contribution

We introduce CSVQA, a challenging benchmark with domain-specific questions and a protocol for reasoning validation, highlighting current models' limitations in scientific reasoning.

Findings

01

Top model achieves only 49.6% accuracy

02

Models struggle with domain knowledge integration

03

Benchmark emphasizes real-world scientific content

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal understanding, yet their capabilities for scientific reasoning remain inadequately assessed. Current multimodal benchmarks predominantly evaluate generic image comprehension or text-driven reasoning, lacking authentic scientific contexts that require domain-specific knowledge integration with visual evidence analysis. To fill this gap, we present CSVQA, a diagnostic multimodal benchmark specifically designed for evaluating scientific reasoning through domain-grounded visual question answering. Our benchmark features 1,378 carefully constructed question-answer pairs spanning diverse STEM disciplines, each demanding domain knowledge, integration of visual evidence, and higher-order reasoning. Compared to prior multimodal benchmarks, CSVQA places greater emphasis on real-world scientific content and complex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Skywork/CSVQA
dataset· 361 dl
361 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.