Bearing Syntactic Fruit with Stack-Augmented Neural Networks
Brian DuSell, Ryan Cotterell

TL;DR
This paper introduces stack-augmented neural networks that can generalize in human-like ways without extensive training conditions, advancing understanding of language acquisition modeling.
Contribution
It demonstrates for the first time that stack-augmented neural networks can generalize hierarchically without special training conditions, using transformer and RNN architectures with novel stack modifications.
Findings
Transformers with nondeterministic stacks outperform other architectures.
Stack-augmented RNNs show improved hierarchical generalization.
Results suggest these models better mimic human language learning.
Abstract
Any finite set of training data is consistent with an infinite number of hypothetical algorithms that could have generated it. Studies have shown that when human children learn language, they consistently favor hypotheses based on hierarchical syntactic rules without ever encountering disambiguating examples. A recent line of work has inquired as to whether common neural network architectures share this bias, finding that they do so only under special conditions: when syntactically supervised, when pre-trained on massive corpora, or when trained long past convergence. In this paper, we demonstrate, for the first time, neural network architectures that are able to generalize in human-like fashion without any of the aforementioned requirements: stack-augmented neural networks. We test three base architectures (transformer, simple RNN, LSTM) augmented with two styles of stack: the…
Peer Reviews
Decision·Submitted to ICLR 2026
- The research question, the hypothesis, and the takeaways are clear and straightforward to understand, partly owing to the fact that this problem operationalized in terms of the specific datasets used has a well established cottage industry (again not in a negative sense at all, just for lack of a better term) - The exposition of the scope of the contribution and the experimental setup is generally clear.
- I think "syntactically supervised" is a bit of a vague term for addressing what McCoy et al. 2020 did - my inference is that this refers to the tree-structured network experiments, but highlighting the fact that training of such networks require explicit representations of syntactic structure in the training data. Since "syntactically supervised" is ambiguous between only data-level supervision and the architecture requiring parsed inputs, it might be expositionally clearer & make the new cont
This work rigorously sweeps over architectural parameterizations. The presentation of the existing methods is thorough.
The primary weakness of this work is that the empirical contribution is relatively small. The majority of the text is dedicated to explaining architectural details already described in related work, or in describing fairly small architectural innovations that do not yield empirical improvements (i.e., the +R reading shortcut still results in a negative LR). Many of these descriptions can be put in an appendix, with the main text expanded to include additional analyses and experiments. For exam
The paper tackles an issue at the intersection of machine learning and linguistic theory — whether explicit structural memory can induce human-like syntactic generalization under “poverty of the stimulus.” The motivation is articulated clearly and grounded in prior psycholinguistic work. The work effectively connects formal-language theory, ML architectures, and cognitive modeling. This synthesis gives the study conceptual depth beyond architecture comparisons. The narrative is well structured:
While the paper presents an elegant and well-executed study linking stack-augmented architectures with hierarchical generalization, several limitations remain. First, the experimental scope is narrowed: all results rely on small synthetic grammars and two tasks within the poverty-of-stimulus framework, leaving unclear whether the observed effects generalize to naturalistic or cross-linguistic data. The paper lacks discussion of other diagnostic cases traditionally used to probe structure depend
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage Development and Disorders · Neurobiology of Language and Bilingualism · Topic Modeling
