GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

Yumeng Fu; Jiayin Zhu; Lingling Zhang; Wenjun Wu; Bo Zhao; Shaoxuan Ma; Yushun Zhang; Jun Liu

arXiv:2508.06226·cs.AI·May 15, 2026

GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

Yumeng Fu, Jiayin Zhu, Lingling Zhang, Wenjun Wu, Bo Zhao, Shaoxuan Ma, Yushun Zhang, Jun Liu

PDF

1 Repo

TL;DR

GeoLaux introduces a detailed benchmark dataset for evaluating multimodal large language models' ability to perform long-step geometric reasoning and auxiliary line construction, revealing significant performance gaps and guiding future improvements.

Contribution

The paper presents GeoLaux, a comprehensive dataset and evaluation framework specifically designed for assessing MLLMs' geometry reasoning, especially for long-step problems requiring auxiliary lines.

Findings

01

Models perform worse on long-step problems, with over 50% performance drop in many cases.

02

Auxiliary line construction is critical for geometric reasoning and needs improvement in models.

03

Providing limited hints improves process correctness, while explicit answers may hinder intermediate reasoning.

Abstract

Geometry problem solving (GPS) poses significant challenges for Multimodal Large Language Models (MLLMs) in diagram comprehension, knowledge application, long-step reasoning, and auxiliary line construction. However, current benchmarks lack fine-grained evaluation for long-step problems necessitating auxiliary construction. To address these limitations, we present GeoLaux, a fine-grained annotated dataset comprising 2186 calculation and proof problems. It features long-step reasoning (with an average solution length of 6.51 steps, maximum of 24 steps) and auxiliary line construction (required in 41.8% of problems). Building on the dataset, we conduct a comprehensive five-dimensional evaluation of 23 leading MLLMs. The evaluation yields three pivotal findings: First, models perform significantly worse on long-step problems compared to short-step ones, with 18 models exhibiting a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Candice-yu/GeoLaux
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.