Forming Trees with Treeformers
Nilay Patel, Jeffrey Flanigan

TL;DR
This paper introduces Treeformer, an encoder module inspired by the CKY algorithm, that incorporates hierarchical structure into Transformers, significantly improving compositional generalization and performance on various NLP tasks.
Contribution
Treeformer is a novel encoder module that explicitly models hierarchical structure within Transformers, enhancing their ability to handle compositional language tasks.
Findings
Improves compositional generalization in NLP tasks
Enhances performance in machine translation and summarization
Demonstrates the benefits of hierarchical structure in Transformers
Abstract
Human language is known to exhibit a nested, hierarchical structure, allowing us to form complex sentences out of smaller pieces. However, many state-of-the-art neural networks models such as Transformers have no explicit hierarchical structure in its architecture -- that is, they don't have an inductive bias toward hierarchical structure. Additionally, Transformers are known to perform poorly on compositional generalization tasks which require such structures. In this paper, we introduce Treeformer, a general-purpose encoder module inspired by the CKY algorithm which learns a composition operator and pooling function to construct hierarchical encodings for phrases and sentences. Our extensive experiments demonstrate the benefits of incorporating hierarchical structure into the Transformer and show significant improvements in compositional generalization as well as in downstream tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Adam
