GeoR-Bench: Evaluating Geoscience Visual Reasoning
Yushuo Zheng, Zicheng Zhang, Huiyu Duan, Chunyi Li, Zijian Chen, Ziheng Jia, Yue Shi, Ke Gu, Xiongkuo Min, Guangtao Zhai

TL;DR
GeoR-Bench introduces a comprehensive benchmark for evaluating AI's ability to perform reasoning tasks in geoscience visual data, highlighting current models' limitations in understanding earth science processes.
Contribution
The paper presents GeoR-Bench, a new benchmark with diverse geoscience tasks and evaluation criteria, to assess and improve AI reasoning in geoscience applications.
Findings
Current models achieve low accuracy, with the best at 42.7% strict accuracy.
Visual quality often exceeds scientific reasoning accuracy.
Geoscience reasoning remains a significant challenge for AI models.
Abstract
Geoscience intelligence is expected to understand, reason about, and predict earth system changes to support human decision-making in critical domains such as disaster response, climate adaptation and environmental protection. Although current research has shown promising progress on specific geoscience tasks, such as remote sensing interpretation, geographic question-answering, existing benchmarks remain largely task-specific which failing to capture the open-ended real world geoscience problems. As a result, it remains unclear how far current AI systems are from achieving genuine geoscience intelligence. To address this gap, we present \textbf{GeoR-Bench}, a \underline{Bench}mark for evaluating \underline{Geo}science visual \underline{R}easoning through reasoning informed visual editing tasks. GeoR-Bench contains 440 curated samples spanning 6 geoscience categories and 24 task types,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
