GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

Jinwoong Kim; Rui Yang; Huishuai Zhang

arXiv:2605.13167·cs.CL·May 14, 2026

GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

Jinwoong Kim, Rui Yang, Huishuai Zhang

PDF

1 Repo

TL;DR

GeoBuildBench is a new benchmark for testing large language models' ability to generate executable geometric constructions from natural language problems, emphasizing interactive reasoning and constraint satisfaction.

Contribution

It introduces a novel interactive geometry construction benchmark with 489 problems, enabling evaluation of models' grounded reasoning and self-correction capabilities.

Findings

01

Models often hallucinate structures and miss objects in generated diagrams.

02

Models struggle to satisfy geometric constraints despite visual and feedback cues.

03

GeoBuildBench provides a rigorous testbed for grounded, executable reasoning in geometry.

Abstract

We introduce GeoBuildBench, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctness or static diagram interpretation, GeoBuildBench treats geometry diagram as an interactive construction task: given a textual problem, an agent must generate a domain-specific language (DSL) program to produce a diagram satisfying explicitly specified geometric objects and verifiable constraints. The benchmark features 489 Chinese textbook-style problems, curated through automated filtering and human validation to ensure text-complete, constructible problem specifications. We evaluate several state-of-the-art multimodal models in a bounded iterative setting and show that, despite reasonable success rates,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.