Causal Transformers Perform Below Chance on Recursive Nested Constructions, Unlike Humans
Yair Lakretz, Th\'eo Desbordes, Dieuwke Hupkes, Stanislas Dehaene

TL;DR
This study evaluates how well state-of-the-art Transformer language models handle recursive nested constructions, revealing they excel on short dependencies but fail on long-range recursive structures, unlike humans.
Contribution
The paper demonstrates that Transformer LMs perform well on short-range dependencies but fail on long-range recursive structures, highlighting a key limitation in their linguistic processing capabilities.
Findings
Transformers perform near-perfect on short-range embedded dependencies.
Performance drops below chance on long-range embedded dependencies.
Adding three words to the dependency causes a sharp performance decline.
Abstract
Recursive processing is considered a hallmark of human linguistic abilities. A recent study evaluated recursive processing in recurrent neural language models (RNN-LMs) and showed that such models perform below chance level on embedded dependencies within nested constructions -- a prototypical example of recursion in natural language. Here, we study if state-of-the-art Transformer LMs do any better. We test four different Transformer LMs on two different types of nested constructions, which differ in whether the embedded (inner) dependency is short or long range. We find that Transformers achieve near-perfect performance on short-range embedded dependencies, significantly better than previous results reported for RNN-LMs and humans. However, on long-range embedded dependencies, Transformers' performance sharply drops below chance level. Remarkably, the addition of only three words to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Neurobiology of Language and Bilingualism
MethodsMulti-Head Attention · Attention Is All You Need · Test · Linear Layer · Absolute Position Encodings · Softmax · Residual Connection · Adam · Label Smoothing · Byte Pair Encoding
