Trees in transformers: a theoretical analysis of the Transformer's ability to represent trees
Qi He, Jo\~ao Sedoc, Jordan Rodu

TL;DR
This paper provides a theoretical analysis demonstrating that Transformer networks can learn tree structures, supported by experiments showing comparable accuracy to explicitly encoded tree models, confirming their capacity to represent hierarchical data.
Contribution
The paper offers the first theoretical proof that Transformers can learn tree structures and empirically verifies this ability through synthetic data experiments.
Findings
Transformers can theoretically learn tree backbones.
Two linear layers with ReLU can recover any tree backbone.
Transformers achieve similar accuracy with or without explicit tree encoding.
Abstract
Transformer networks are the de facto standard architecture in natural language processing. To date, there are no theoretical analyses of the Transformer's ability to capture tree structures. We focus on the ability of Transformer networks to learn tree structures that are important for tree transduction problems. We first analyze the theoretical capability of the standard Transformer architecture to learn tree structures given enumeration of all possible tree backbones, which we define as trees without labels. We then prove that two linear layers with ReLU activation function can recover any tree backbone from any two nonzero, linearly independent starting backbones. This implies that a Transformer can learn tree structures well in theory. We conduct experiments with synthetic data and find that the standard Transformer achieves similar accuracy compared to a Transformer where tree…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Residual Connection · Softmax · Adam · Position-Wise Feed-Forward Layer · Dense Connections
