TL;DR
GeoBuildBench is a new benchmark for testing large language models' ability to generate executable geometric constructions from natural language problems, emphasizing interactive reasoning and constraint satisfaction.
Contribution
It introduces a novel interactive geometry construction benchmark with 489 problems, enabling evaluation of models' grounded reasoning and self-correction capabilities.
Findings
Models often hallucinate structures and miss objects in generated diagrams.
Models struggle to satisfy geometric constraints despite visual and feedback cues.
GeoBuildBench provides a rigorous testbed for grounded, executable reasoning in geometry.
Abstract
We introduce GeoBuildBench, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctness or static diagram interpretation, GeoBuildBench treats geometry diagram as an interactive construction task: given a textual problem, an agent must generate a domain-specific language (DSL) program to produce a diagram satisfying explicitly specified geometric objects and verifiable constraints. The benchmark features 489 Chinese textbook-style problems, curated through automated filtering and human validation to ensure text-complete, constructible problem specifications. We evaluate several state-of-the-art multimodal models in a bounded iterative setting and show that, despite reasonable success rates,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
