Evaluating Transformer's Ability to Learn Mildly Context-Sensitive   Languages

Shunjie Wang; Shane Steinert-Threlkeld

arXiv:2309.00857·cs.CL·October 20, 2023

Evaluating Transformer's Ability to Learn Mildly Context-Sensitive Languages

Shunjie Wang, Shane Steinert-Threlkeld

PDF

Open Access

TL;DR

This paper investigates how well Transformer models can learn mildly context-sensitive languages, revealing they generalize well within training distribution but struggle with longer sequences, with learned patterns resembling dependency relations.

Contribution

The study provides empirical evidence on Transformer's capabilities and limitations in modeling mildly context-sensitive languages, highlighting their strengths and weaknesses compared to LSTMs.

Findings

01

Transformers generalize well to unseen in-distribution data.

02

Transformers perform worse than LSTMs on longer strings.

03

Self-attention patterns model dependency relations and counting behaviors.

Abstract

Despite the fact that Transformers perform well in NLP tasks, recent studies suggest that self-attention is theoretically limited in learning even some regular and context-free languages. These findings motivated us to think about their implications in modeling natural language, which is hypothesized to be mildly context-sensitive. We test the Transformer's ability to learn mildly context-sensitive languages of varying complexities, and find that they generalize well to unseen in-distribution data, but their ability to extrapolate to longer strings is worse than that of LSTMs. Our analyses show that the learned self-attention patterns and representations modeled dependency relations and demonstrated counting behavior, which may have helped the models solve the languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems