Context-Free Recognition with Transformers

Selim Jerad; Anej Svete; Sophie Hao; Ryan Cotterell; William Merrill

arXiv:2601.01754·cs.LG·February 9, 2026

Context-Free Recognition with Transformers

Selim Jerad, Anej Svete, Sophie Hao, Ryan Cotterell, William Merrill

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that transformers with looping layers can recognize all context-free languages, especially efficient for subclasses like unambiguous CFLs, but practical recognition depends on the complexity of the language.

Contribution

It proves that looped transformers can recognize all CFLs with sufficient layers and padding, and identifies more efficient recognition for natural subclasses like unambiguous CFLs.

Findings

01

Transformers with O(log n) layers can recognize all CFLs.

02

Recognition of unambiguous CFLs is more computationally feasible.

03

Empirical validation shows looping improves language recognition capabilities.

Abstract

Transformers excel empirically on tasks that process well-formed inputs according to some grammar, such as natural language and code. However, it remains unclear how they can process grammatical syntax. In fact, under standard complexity conjectures, standard transformers cannot recognize context-free languages (CFLs), a canonical formalism to describe syntax, or even regular languages, a subclass of CFLs. Past work proves that $O (lo g (n))$ looping layers (w.r.t. input length n) allows transformers to recognize regular languages, but the question of context-free recognition remained open. In this work, we show that looped transformers with $O (lo g (n))$ looping layers and $O (n^{6})$ padding tokens can recognize all CFLs. However, training and inference with $O (n^{6})$ padding tokens is potentially impractical. Fortunately, we show that, for natural…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

The paper establishes a strong theoretical connection between CFL recognition and model size, which may provide useful insights for designing more efficient models.

Weaknesses

1. The analysis focuses solely on model expressiveness and does not address learnability. 2. Although $O(n^3)$ padding is an improvement over $O(n^6)$, it may still be impractical in real-world settings. 3. The experimental section does not appear to relate clearly to the theoretical results, particularly concerning the number of padding tokens.

Reviewer 02Rating 6Confidence 4

Strengths

* Paper is very well written. I found the theoretical exposition to ber quite clear. The approach taken by the authors to construct the algorithms and conduct their analysis before discussing relation/implementation with Transformers was very clear and pleasant to read. * The theoretical results are novel and interesting. As the authors have also stated, this work is the first theoretical contribution (to the best of my knowledge as well) which proves CFL recognition with transformers. * The the

Weaknesses

- The biggest weakness for me is experiments section. I feel alot of methodological details are missing from this section. See the the "Questions" section below. I feel an appendix section with more experimental details could be beneficial to readers who want to better understand/reproduce similar setups. - I find the model of Transformers presented by the authors to be unrealistic and quite detached from practice. I have no problem with the use of hard attention however: - Padding tokens are

Reviewer 03Rating 4Confidence 3

Strengths

The circuit and computational complexity of CFL is an interesting topic, and transformers recently have established useful connections with circuit complexity. It is now known whether CFL is in NL, it seems unlikely that they are in NC^1.

Weaknesses

The proof of the main result is a bit sloppy, and I could not understand the details (and could not restore them myself). The idea in short is that there are some guessing rules that allow to check in O(log n) recursive steps whether a given non-terminal derives a word of length n. More precisely, there are polyhnomially many ``derivation statements'' (``items'' and ``slashed items''). Each derivation statement is either an easily checkable base case or can be derived from 2 other derivation sta

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Natural Language Processing Techniques · DNA and Biological Computing