TL;DR
GeoLaux introduces a detailed benchmark dataset for evaluating multimodal large language models' ability to perform long-step geometric reasoning and auxiliary line construction, revealing significant performance gaps and guiding future improvements.
Contribution
The paper presents GeoLaux, a comprehensive dataset and evaluation framework specifically designed for assessing MLLMs' geometry reasoning, especially for long-step problems requiring auxiliary lines.
Findings
Models perform worse on long-step problems, with over 50% performance drop in many cases.
Auxiliary line construction is critical for geometric reasoning and needs improvement in models.
Providing limited hints improves process correctness, while explicit answers may hinder intermediate reasoning.
Abstract
Geometry problem solving (GPS) poses significant challenges for Multimodal Large Language Models (MLLMs) in diagram comprehension, knowledge application, long-step reasoning, and auxiliary line construction. However, current benchmarks lack fine-grained evaluation for long-step problems necessitating auxiliary construction. To address these limitations, we present GeoLaux, a fine-grained annotated dataset comprising 2186 calculation and proof problems. It features long-step reasoning (with an average solution length of 6.51 steps, maximum of 24 steps) and auxiliary line construction (required in 41.8% of problems). Building on the dataset, we conduct a comprehensive five-dimensional evaluation of 23 leading MLLMs. The evaluation yields three pivotal findings: First, models perform significantly worse on long-step problems compared to short-step ones, with 18 models exhibiting a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
