Can Large Language Models Understand Symbolic Graphics Programs?
Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z. Xiao, Katherine M. Collins, Joshua B. Tenenbaum, Adrian Weller, Michael J. Black, Bernhard Sch\"olkopf

TL;DR
This paper introduces a benchmark to evaluate large language models' ability to understand and reason about symbolic graphics programs, revealing their strengths and limitations in semantic visual understanding.
Contribution
It proposes a new benchmark for semantic understanding of symbolic graphics programs and introduces Symbolic Instruction Tuning to enhance LLM reasoning capabilities.
Findings
LLMs show improved reasoning on symbolic graphics with SIT.
Transformations that preserve semantics challenge LLM understanding.
SIT enhances general reasoning beyond symbolic programs.
Abstract
Against the backdrop of enthusiasm for large language models (LLMs), there is a growing need to scientifically assess their capabilities and shortcomings. This is nontrivial in part because it is difficult to find tasks which the models have not encountered during training. Utilizing symbolic graphics programs, we propose a domain well-suited to test multiple spatial-semantic reasoning skills of LLMs. Popular in computer graphics, these programs procedurally generate visual data. While LLMs exhibit impressive skills in general program synthesis and analysis, symbolic graphics programs offer a new layer of evaluation: they allow us to test an LLM's ability to answer semantic questions about the images or 3D geometries without a vision encoder. To semantically understand the symbolic programs, LLMs would need to possess the ability to "imagine" and reason how the corresponding graphics…
Peer Reviews
Decision·ICLR 2025 Spotlight
1. Innovative in presenting a benchmark and dataset that adds value to multi-modal LLM and vision foundation model research. 2. Extensive experimentation with well-designed elements, particularly in the fine-grained categorization of SVG-Understanding. While categories like "color" and "shape" might be addressed through subpattern matching in SVG codes, "Semantics" and "Reasoning" better assess true reasoning capability. 3. Clear presentation and valuable insights.
1. The SGP-MNIST experiments show limited spatial reasoning in LLMs, suggesting that some claims about models’ spatial reasoning or “visual imagery” abilities may be overstated. 2. The conclusion is somewhat underwhelming, reiterating known principles such as scaling laws and fine-tuning effects.
1. The benchmark dataset is large, with 4,340 questions in the SVG set and 2,400 in the CAD set. 2. Both 2D and 3D objects are evaluated. 3. The paper addresses data leakage issues by utilizing global transformations like translations and rotations. 4. The proposed instruction tuning dataset improves general reasoning ability in LLMs. 5. The paper provides a critical view on current LLMs' capability.
1. The human study reports 90% labeling accuracy, leaving 10% mislabeled data, which could significantly impact the results given limited variation of the accuracy in LLM's evaluation. I suggest doing a human filtering across the whole benchmark to improve the accuracy of the ground truth answer. 2. The method for generating questions on 3D dataset from limited rendered views may lead to errors, especially in counting tasks; for example, the second part Figure 7 (row 3, column 3) is confusing. T
- SGPs offer an intriguing middle ground between perception and reasoning. While humans can intuitively link rendered images to semantic concepts, associating these same concepts with the underlying SVG schema may demand additional time and coding knowledge. - The dataset presented is highly scalable, as millions of SVG images and CAD programs with permissive licenses are available online for use in this benchmark. - The proposed experiments are both well-motivated and effectively executed. - Th
* L108: "highly scalable benchmark." While I agree that SGPs have this potential, the paper does not demonstrate specific contributions toward achieving scalability. For example, benchmarks like [MineDojo](https://minedojo.org/) achieve high scalability by developing tools to efficiently scrape relevant data from platforms like YouTube, Reddit, and Wikia, resulting in a dataset of around 700k Minecraft videos and 6M+ comments. In contrast, the SGP dataset in this paper only contains ~5,000 data
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques
