Can LLMs Reason Structurally? Benchmarking via the Lens of Data Structures
Yu He, Yingxi Li, Colin White, Ellen Vitercik

TL;DR
This paper introduces DSR-Bench, a comprehensive benchmark using data structures to evaluate the structural reasoning abilities of large language models, revealing significant limitations in their algorithmic reasoning skills.
Contribution
The paper presents DSR-Bench, a novel diagnostic benchmark with automated generation for assessing LLMs' understanding of data structures and their reasoning capabilities.
Findings
Top LLMs score only 0.46/1 on challenging instances
Models perform poorly on spatial and context-rich data
Struggle to reason over their own code
Abstract
Large language models (LLMs) are deployed on increasingly complex tasks that require multi-step decision-making. Understanding their algorithmic reasoning abilities is therefore crucial. However, we lack a diagnostic benchmark for evaluating this capability. We propose data structures as a principled lens: as fundamental building blocks of algorithms, they naturally probe structural reasoning-the ability to understand and manipulate relationships such as order, hierarchy, and connectivity that underpin algorithmic reasoning. We introduce DSR-Bench, spanning 20 data structures, 35 operations, and 4,140 problem instances. DSR-Bench features hierarchical task organization, fully automated generation and evaluation, and fine-grained diagnostics. Evaluating 13 state-of-the-art LLMs reveals critical limitations: the top-performing model achieves only 0.46/1 on challenging instances. Three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsFocus
