DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks

Hyunjun Kim; Sooyoung Ryu

arXiv:2512.01174·cs.CL·December 2, 2025

DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks

Hyunjun Kim, Sooyoung Ryu

PDF

Open Access

TL;DR

DrawingBench is a transparent, rule-based evaluation framework for assessing the spatial reasoning and GUI interaction capabilities of large language models through mouse-based drawing tasks, emphasizing auditability and trustworthiness.

Contribution

We introduce DrawingBench, a novel, open-source framework that enables transparent, reproducible evaluation of LLMs' spatial reasoning and GUI interaction skills with objective criteria and external oversight.

Findings

01

Models achieved 92.8% perfect performance with feedback.

02

Explicit, verifiable criteria led to 100% accuracy.

03

External oversight improved model performance significantly.

Abstract

As agentic AI systems increasingly operate autonomously, establishing trust through verifiable evaluation becomes critical. Yet existing benchmarks lack the transparency and auditability needed to assess whether agents behave reliably. We present DrawingBench, a verification framework for evaluating the trustworthiness of agentic LLMs through spatial reasoning tasks that require generating sequences of low-level GUI actions. Unlike opaque evaluations, DrawingBench provides transparent, rule-based assessment: 8 objective criteria enable reproducible scoring, while action-level inspection allows stakeholders to audit agent behavior. Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels, deterministic evaluation metrics, and an external oversight mechanism through multi-turn feedback that enables human control over agent refinement. Evaluating four…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Artificial Intelligence in Games · Explainable Artificial Intelligence (XAI)