Bearing Syntactic Fruit with Stack-Augmented Neural Networks

Brian DuSell; Ryan Cotterell

arXiv:2511.03547·cs.CL·November 6, 2025

Bearing Syntactic Fruit with Stack-Augmented Neural Networks

Brian DuSell, Ryan Cotterell

PDF

Open Access 3 Reviews

TL;DR

This paper introduces stack-augmented neural networks that can generalize in human-like ways without extensive training conditions, advancing understanding of language acquisition modeling.

Contribution

It demonstrates for the first time that stack-augmented neural networks can generalize hierarchically without special training conditions, using transformer and RNN architectures with novel stack modifications.

Findings

01

Transformers with nondeterministic stacks outperform other architectures.

02

Stack-augmented RNNs show improved hierarchical generalization.

03

Results suggest these models better mimic human language learning.

Abstract

Any finite set of training data is consistent with an infinite number of hypothetical algorithms that could have generated it. Studies have shown that when human children learn language, they consistently favor hypotheses based on hierarchical syntactic rules without ever encountering disambiguating examples. A recent line of work has inquired as to whether common neural network architectures share this bias, finding that they do so only under special conditions: when syntactically supervised, when pre-trained on massive corpora, or when trained long past convergence. In this paper, we demonstrate, for the first time, neural network architectures that are able to generalize in human-like fashion without any of the aforementioned requirements: stack-augmented neural networks. We test three base architectures (transformer, simple RNN, LSTM) augmented with two styles of stack: the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- The research question, the hypothesis, and the takeaways are clear and straightforward to understand, partly owing to the fact that this problem operationalized in terms of the specific datasets used has a well established cottage industry (again not in a negative sense at all, just for lack of a better term) - The exposition of the scope of the contribution and the experimental setup is generally clear.

Weaknesses

- I think "syntactically supervised" is a bit of a vague term for addressing what McCoy et al. 2020 did - my inference is that this refers to the tree-structured network experiments, but highlighting the fact that training of such networks require explicit representations of syntactic structure in the training data. Since "syntactically supervised" is ambiguous between only data-level supervision and the architecture requiring parsed inputs, it might be expositionally clearer & make the new cont

Reviewer 02Rating 2Confidence 3

Strengths

This work rigorously sweeps over architectural parameterizations. The presentation of the existing methods is thorough.

Weaknesses

The primary weakness of this work is that the empirical contribution is relatively small. The majority of the text is dedicated to explaining architectural details already described in related work, or in describing fairly small architectural innovations that do not yield empirical improvements (i.e., the +R reading shortcut still results in a negative LR). Many of these descriptions can be put in an appendix, with the main text expanded to include additional analyses and experiments. For exam

Reviewer 03Rating 4Confidence 3

Strengths

The paper tackles an issue at the intersection of machine learning and linguistic theory — whether explicit structural memory can induce human-like syntactic generalization under “poverty of the stimulus.” The motivation is articulated clearly and grounded in prior psycholinguistic work. The work effectively connects formal-language theory, ML architectures, and cognitive modeling. This synthesis gives the study conceptual depth beyond architecture comparisons. The narrative is well structured:

Weaknesses

While the paper presents an elegant and well-executed study linking stack-augmented architectures with hierarchical generalization, several limitations remain. First, the experimental scope is narrowed: all results rely on small synthetic grammars and two tasks within the poverty-of-stimulus framework, leaving unclear whether the observed effects generalize to naturalistic or cross-linguistic data. The paper lacks discussion of other diagnostic cases traditionally used to probe structure depend

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage Development and Disorders · Neurobiology of Language and Bilingualism · Topic Modeling