Context-Free Recognition with Transformers
Selim Jerad, Anej Svete, Sophie Hao, Ryan Cotterell, William Merrill

TL;DR
This paper demonstrates that transformers with looping layers can recognize all context-free languages, especially efficient for subclasses like unambiguous CFLs, but practical recognition depends on the complexity of the language.
Contribution
It proves that looped transformers can recognize all CFLs with sufficient layers and padding, and identifies more efficient recognition for natural subclasses like unambiguous CFLs.
Findings
Transformers with O(log n) layers can recognize all CFLs.
Recognition of unambiguous CFLs is more computationally feasible.
Empirical validation shows looping improves language recognition capabilities.
Abstract
Transformers excel empirically on tasks that process well-formed inputs according to some grammar, such as natural language and code. However, it remains unclear how they can process grammatical syntax. In fact, under standard complexity conjectures, standard transformers cannot recognize context-free languages (CFLs), a canonical formalism to describe syntax, or even regular languages, a subclass of CFLs. Past work proves that looping layers (w.r.t. input length n) allows transformers to recognize regular languages, but the question of context-free recognition remained open. In this work, we show that looped transformers with looping layers and padding tokens can recognize all CFLs. However, training and inference with padding tokens is potentially impractical. Fortunately, we show that, for natural…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper establishes a strong theoretical connection between CFL recognition and model size, which may provide useful insights for designing more efficient models.
1. The analysis focuses solely on model expressiveness and does not address learnability. 2. Although $O(n^3)$ padding is an improvement over $O(n^6)$, it may still be impractical in real-world settings. 3. The experimental section does not appear to relate clearly to the theoretical results, particularly concerning the number of padding tokens.
* Paper is very well written. I found the theoretical exposition to ber quite clear. The approach taken by the authors to construct the algorithms and conduct their analysis before discussing relation/implementation with Transformers was very clear and pleasant to read. * The theoretical results are novel and interesting. As the authors have also stated, this work is the first theoretical contribution (to the best of my knowledge as well) which proves CFL recognition with transformers. * The the
- The biggest weakness for me is experiments section. I feel alot of methodological details are missing from this section. See the the "Questions" section below. I feel an appendix section with more experimental details could be beneficial to readers who want to better understand/reproduce similar setups. - I find the model of Transformers presented by the authors to be unrealistic and quite detached from practice. I have no problem with the use of hard attention however: - Padding tokens are
The circuit and computational complexity of CFL is an interesting topic, and transformers recently have established useful connections with circuit complexity. It is now known whether CFL is in NL, it seems unlikely that they are in NC^1.
The proof of the main result is a bit sloppy, and I could not understand the details (and could not restore them myself). The idea in short is that there are some guessing rules that allow to check in O(log n) recursive steps whether a given non-terminal derives a word of length n. More precisely, there are polyhnomially many ``derivation statements'' (``items'' and ``slashed items''). Each derivation statement is either an easily checkable base case or can be derived from 2 other derivation sta
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Natural Language Processing Techniques · DNA and Biological Computing
