LTD-Bench: Evaluating Large Language Models by Letting Them Draw

Liuhao Lin; Ke Li; Zihan Xu; Yuchen Shi; Yulei Qin; Yan Zhang; Xing Sun; Rongrong Ji

arXiv:2511.02347·cs.CL·November 5, 2025

LTD-Bench: Evaluating Large Language Models by Letting Them Draw

Liuhao Lin, Ke Li, Zihan Xu, Yuchen Shi, Yulei Qin, Yan Zhang, Xing Sun, Rongrong Ji

PDF

Open Access

TL;DR

LTD-Bench introduces a novel evaluation method for large language models by requiring them to generate visual drawings, revealing spatial reasoning limitations that traditional metrics overlook and providing intuitive insights into model capabilities.

Contribution

This paper presents LTD-Bench, a new benchmark that evaluates LLMs through visual output tasks, bridging the gap between statistical performance and spatial reasoning understanding.

Findings

01

State-of-the-art models show significant spatial reasoning deficiencies.

02

Traditional benchmarks do not reveal models' limitations in spatial understanding.

03

LTD-Bench enables diagnostic analysis of model capabilities.

Abstract

Current evaluation paradigms for large language models (LLMs) represent a critical blind spot in AI research--relying on opaque numerical metrics that conceal fundamental limitations in spatial reasoning while providing no intuitive understanding of model capabilities. This deficiency creates a dangerous disconnect between reported performance and practical abilities, particularly for applications requiring physical world understanding. We introduce LTD-Bench, a breakthrough benchmark that transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code. This approach makes spatial reasoning limitations immediately apparent even to non-experts, bridging the fundamental gap between statistical performance and intuitive assessment. LTD-Bench implements a comprehensive methodology with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)