TL;DR
SciVQR is a comprehensive multimodal benchmark designed to evaluate advanced scientific reasoning across multiple disciplines, emphasizing complex, multi-step inference and reasoning traceability in large language models.
Contribution
Introduces SciVQR, a new multidisciplinary multimodal benchmark with expert solutions, to better evaluate and understand scientific reasoning in large language models.
Findings
Leading MLLMs show significant limitations in complex reasoning tasks.
SciVQR reveals gaps in models' ability to handle interdisciplinary scientific visuals.
Benchmark encourages development of models with improved multi-step reasoning.
Abstract
Scientific reasoning is a key aspect of human intelligence, requiring the integration of multimodal inputs, domain expertise, and multi-step inference across various subjects. Existing benchmarks for multimodal large language models (MLLMs) often fail to capture the complexity and traceability of reasoning processes necessary for rigorous evaluation. To fill this gap, we introduce SciVQR, a multimodal benchmark covering 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology. SciVQR includes domain-specific visuals, such as equations, charts, and diagrams, and challenges models to combine visual comprehension with reasoning. The tasks range from basic factual recall to complex, multi-step inferences, with 46% including expert-authored solutions. SciVQR not only evaluates final answers but also examines the reasoning process, providing insights into how models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
