Learning Bounded Context-Free-Grammar via LSTM and the Transformer:Difference and Explanations
Hui Shi, Sicun Gao, Yuandong Tian, Xinyun Chen, Jishen Zhao

TL;DR
This paper compares LSTM and Transformer models in learning context-free grammars, revealing that Transformers better capture stack operations due to their latent space decomposition, explaining their superior performance.
Contribution
It introduces an oracle training paradigm to analyze how LSTM and Transformer decompose latent spaces and their ability to simulate stack operations in CFL learning.
Findings
Transformers outperform LSTMs in representing stack operations without forced decomposition.
Forced decomposition aligns LSTM and Transformer performance in CFL learning.
Transformers' latent space better captures stack-based automaton transitions.
Abstract
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks. Theoretical results show that both are Turing-complete and can represent any context-free language (CFL).In practice, it is often observed that Transformer models have better representation power than LSTM. But the reason is barely understood. We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns. To achieve this goal, we introduce an oracle training paradigm, which forces the decomposition of the latent representation of LSTM and the Transformer and supervises with the transitions of the Pushdown Automaton (PDA) of the corresponding CFL. With the forced decomposition, we show that the performance upper bounds of LSTM and Transformer in learning CFL are close: both of them…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Residual Connection · Softmax · Tanh Activation · Adam · Position-Wise Feed-Forward Layer
