Augmenting Transformers with Recursively Composed Multi-grained Representations
Xiang Hu, Qingyang Zhu, Kewei Tu, Wei Wu

TL;DR
ReCAT is a recursive Transformer model that explicitly models hierarchical syntactic structures using a novel CIO layer, enabling deep span interactions, improved performance on span tasks, and interpretable syntactic representations.
Contribution
The paper introduces ReCAT, a recursive Transformer with CIO layers for explicit hierarchical structure modeling, enhancing performance and interpretability without relying on gold trees.
Findings
ReCAT significantly outperforms vanilla Transformers on span-level tasks.
ReCAT's hierarchical structures align well with human-annotated syntactic trees.
CIO layers enable deep intra-span and inter-span interactions.
Abstract
We present ReCAT, a recursive composition augmented Transformer that is able to explicitly model hierarchical syntactic structures of raw texts without relying on gold trees during both learning and inference. Existing research along this line restricts data to follow a hierarchical tree structure and thus lacks inter-span communications. To overcome the problem, we propose a novel contextual inside-outside (CIO) layer that learns contextualized representations of spans through bottom-up and top-down passes, where a bottom-up pass forms representations of high-level spans by composing low-level spans, while a top-down pass combines information inside and outside a span. By stacking several CIO layers between the embedding layer and the attention layers in Transformer, the ReCAT model can perform both deep intra-span and deep inter-span interactions, and thus generate multi-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Absolute Position Encodings · Dense Connections · Layer Normalization · Byte Pair Encoding
