Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks
Shlok Shelat, Jay Raval, Souvik Roy, Manas Gaur

TL;DR
This paper evaluates large language models' ability to perform formal reasoning on unseen automata construction tasks, revealing significant gaps in understanding despite high performance on familiar problems.
Contribution
It introduces a new benchmark for DFA construction from regular languages, highlighting the limitations of LLMs in generalizing reasoning to unseen, complex problems.
Findings
Models excel on factual questions and seen tasks.
Performance drops significantly on unseen problems.
Errors are due to misinterpretation of constraints and semantics.
Abstract
Large language models (LLMs) have demonstrated strong performance on formal language tasks, yet whether this reflects genuine symbolic reasoning or pattern matching on familiar constructions remains unclear. We introduce a benchmark for deterministic finite automata (DFA) construction from regular languages, comprising factual knowledge questions, seen construction problems from public sources, and two types of unseen problems: hand-crafted instances with multiple interacting constraints and systematically generated problems via Arden's theorem. Models achieve perfect accuracy on factual questions and 84-90% on seen tasks. However, accuracy drops sharply on unseen problems (by 30-64%), with failures stemming from systematic misinterpretation of language constraints, incorrect handling of Kleene-star semantics, and a failure to preserve global consistency. We evaluate a three-stage hint…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms
