Diagnosing CFG Interpretation in LLMs
Hanqi Li, Lu Chen, Kai Yu

TL;DR
This paper evaluates large language models' ability to interpret context-free grammars, revealing their limitations in maintaining structural semantics under complex, recursive, and high-density conditions.
Contribution
It introduces RoboGrid, a framework for stress-testing LLMs' syntactic, behavioral, and semantic capabilities in grammar interpretation tasks.
Findings
LLMs often preserve surface syntax but struggle with structural semantics.
Performance drops significantly with increased recursion depth and branching.
Semantic reliance on keywords rather than symbolic induction is observed.
Abstract
As LLMs are increasingly integrated into agentic systems, they must adhere to dynamically defined, machine-interpretable interfaces. We evaluate LLMs as in-context interpreters: given a novel context-free grammar, can LLMs generate syntactically valid, behaviorally functional, and semantically faithful outputs? We introduce RoboGrid, a framework that disentangles syntax, behavior, and semantics through controlled stress-tests of recursion depth, expression complexity, and surface styles. Our experiments reveal a consistent hierarchical degradation: LLMs often maintain surface syntax but fail to preserve structural semantics. Despite the partial mitigation provided by CoT reasoning, performance collapses under structural density, specifically deep recursion and high branching, with semantic alignment vanishing at extreme depths. Furthermore, "Alien" lexicons reveal that LLMs rely on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
