Evaluating Transformer's Ability to Learn Mildly Context-Sensitive Languages
Shunjie Wang, Shane Steinert-Threlkeld

TL;DR
This paper investigates how well Transformer models can learn mildly context-sensitive languages, revealing they generalize well within training distribution but struggle with longer sequences, with learned patterns resembling dependency relations.
Contribution
The study provides empirical evidence on Transformer's capabilities and limitations in modeling mildly context-sensitive languages, highlighting their strengths and weaknesses compared to LSTMs.
Findings
Transformers generalize well to unseen in-distribution data.
Transformers perform worse than LSTMs on longer strings.
Self-attention patterns model dependency relations and counting behaviors.
Abstract
Despite the fact that Transformers perform well in NLP tasks, recent studies suggest that self-attention is theoretically limited in learning even some regular and context-free languages. These findings motivated us to think about their implications in modeling natural language, which is hypothesized to be mildly context-sensitive. We test the Transformer's ability to learn mildly context-sensitive languages of varying complexities, and find that they generalize well to unseen in-distribution data, but their ability to extrapolate to longer strings is worse than that of LSTMs. Our analyses show that the learned self-attention patterns and representations modeled dependency relations and demonstrated counting behavior, which may have helped the models solve the languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
