Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy
Yihong Dong, Jianha Xiao, Xue Jiang, Xuyuan Guo, Zhiyuan Fan, Jiaru Qian, Kechi Zhang, Jia Li, Zhi Jin, and Ge Li

TL;DR
This paper introduces ChomskyBench, a comprehensive benchmark to evaluate large language models' formal reasoning abilities across the Chomsky Hierarchy, revealing performance limitations and efficiency barriers.
Contribution
It is the first to systematically assess LLMs' formal reasoning using the full Chomsky Hierarchy with process-trace evaluation and symbolic verification.
Findings
Performance stratifies with hierarchy levels, increasing difficulty reduces accuracy.
Larger models and advanced inference improve performance but face high computational costs.
LLMs are less efficient than traditional algorithms for formal language recognition tasks.
Abstract
The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language, and deterministic symbolic verifiability. ChomskyBench is composed of a comprehensive suite of language recognition and generation tasks designed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
