Trees in transformers: a theoretical analysis of the Transformer's   ability to represent trees

Qi He; Jo\~ao Sedoc; Jordan Rodu

arXiv:2112.11913·cs.CL·December 23, 2021·1 cites

Trees in transformers: a theoretical analysis of the Transformer's ability to represent trees

Qi He, Jo\~ao Sedoc, Jordan Rodu

PDF

Open Access

TL;DR

This paper provides a theoretical analysis demonstrating that Transformer networks can learn tree structures, supported by experiments showing comparable accuracy to explicitly encoded tree models, confirming their capacity to represent hierarchical data.

Contribution

The paper offers the first theoretical proof that Transformers can learn tree structures and empirically verifies this ability through synthetic data experiments.

Findings

01

Transformers can theoretically learn tree backbones.

02

Two linear layers with ReLU can recover any tree backbone.

03

Transformers achieve similar accuracy with or without explicit tree encoding.

Abstract

Transformer networks are the de facto standard architecture in natural language processing. To date, there are no theoretical analyses of the Transformer's ability to capture tree structures. We focus on the ability of Transformer networks to learn tree structures that are important for tree transduction problems. We first analyze the theoretical capability of the standard Transformer architecture to learn tree structures given enumeration of all possible tree backbones, which we define as trees without labels. We then prove that two linear layers with ReLU activation function can recover any tree backbone from any two nonzero, linearly independent starting backbones. This implies that a Transformer can learn tree structures well in theory. We conduct experiments with synthetic data and find that the standard Transformer achieves similar accuracy compared to a Transformer where tree…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Absolute Position Encodings · Residual Connection · Softmax · Adam · Position-Wise Feed-Forward Layer · Dense Connections