Learning Bounded Context-Free-Grammar via LSTM and the   Transformer:Difference and Explanations

Hui Shi; Sicun Gao; Yuandong Tian; Xinyun Chen; Jishen Zhao

arXiv:2112.09174·cs.CL·March 24, 2022·1 cites

Learning Bounded Context-Free-Grammar via LSTM and the Transformer:Difference and Explanations

Hui Shi, Sicun Gao, Yuandong Tian, Xinyun Chen, Jishen Zhao

PDF

Open Access 1 Repo

TL;DR

This paper compares LSTM and Transformer models in learning context-free grammars, revealing that Transformers better capture stack operations due to their latent space decomposition, explaining their superior performance.

Contribution

It introduces an oracle training paradigm to analyze how LSTM and Transformer decompose latent spaces and their ability to simulate stack operations in CFL learning.

Findings

01

Transformers outperform LSTMs in representing stack operations without forced decomposition.

02

Forced decomposition aligns LSTM and Transformer performance in CFL learning.

03

Transformers' latent space better captures stack-based automaton transitions.

Abstract

Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks. Theoretical results show that both are Turing-complete and can represent any context-free language (CFL).In practice, it is often observed that Transformer models have better representation power than LSTM. But the reason is barely understood. We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns. To achieve this goal, we introduce an oracle training paradigm, which forces the decomposition of the latent representation of LSTM and the Transformer and supervises with the transitions of the Pushdown Automaton (PDA) of the corresponding CFL. With the forced decomposition, we show that the performance upper bounds of LSTM and Transformer in learning CFL are close: both of them…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shihui2010/learn_cfg_with_neural_network
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Residual Connection · Softmax · Tanh Activation · Adam · Position-Wise Feed-Forward Layer