DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks
Hyunjun Kim, Sooyoung Ryu

TL;DR
DrawingBench is a transparent, rule-based evaluation framework for assessing the spatial reasoning and GUI interaction capabilities of large language models through mouse-based drawing tasks, emphasizing auditability and trustworthiness.
Contribution
We introduce DrawingBench, a novel, open-source framework that enables transparent, reproducible evaluation of LLMs' spatial reasoning and GUI interaction skills with objective criteria and external oversight.
Findings
Models achieved 92.8% perfect performance with feedback.
Explicit, verifiable criteria led to 100% accuracy.
External oversight improved model performance significantly.
Abstract
As agentic AI systems increasingly operate autonomously, establishing trust through verifiable evaluation becomes critical. Yet existing benchmarks lack the transparency and auditability needed to assess whether agents behave reliably. We present DrawingBench, a verification framework for evaluating the trustworthiness of agentic LLMs through spatial reasoning tasks that require generating sequences of low-level GUI actions. Unlike opaque evaluations, DrawingBench provides transparent, rule-based assessment: 8 objective criteria enable reproducible scoring, while action-level inspection allows stakeholders to audit agent behavior. Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels, deterministic evaluation metrics, and an external oversight mechanism through multi-turn feedback that enables human control over agent refinement. Evaluating four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Artificial Intelligence in Games · Explainable Artificial Intelligence (XAI)
