The Double Helix inside the NLP Transformer
Jason H.J. Lu, Qingzhen Guo

TL;DR
This paper presents a framework for analyzing information types in NLP Transformers, revealing a helix-shaped positional information pattern and how different layers encode syntactic and semantic features.
Contribution
It introduces a novel analysis method distinguishing four information layers and proposes a Linear-and-Add approach for positional embedding, uncovering the helix pattern of positional data.
Findings
Positional information forms a helix in deep layers.
Encoder layers generate Part-of-Speech clusters.
Decoder layers reveal PoS clusters via bigram analysis.
Abstract
We introduce a framework for analyzing various types of information in an NLP Transformer. In this approach, we distinguish four layers of information: positional, syntactic, semantic, and contextual. We also argue that the common practice of adding positional information to semantic embedding is sub-optimal and propose instead a Linear-and-Add approach. Our analysis reveals an autogenetic separation of positional information through the deep layers. We show that the distilled positional components of the embedding vectors follow the path of a helix, both on the encoder side and on the decoder side. We additionally show that on the encoder side, the conceptual dimensions generate Part-of-Speech (PoS) clusters. On the decoder side, we show that a di-gram approach helps to reveal the PoS clusters of the next token. Our approach paves a way to elucidate the processing of information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Language and cultural evolution
MethodsMulti-Head Attention · Attention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Adam · Byte Pair Encoding
