Segatron: Segment-Aware Transformer for Language Modeling and Understanding
He Bai, Peng Shi, Jimmy Lin, Yuqing Xie, Luchen Tan, Kun Xiong, Wen, Gao, Ming Li

TL;DR
Segatron introduces a segment-aware positional encoding mechanism to Transformer models, enhancing their contextual understanding and performance across language modeling and NLP tasks.
Contribution
The paper proposes a novel segment-aware encoding method for Transformers, improving language model perplexity and NLP task performance over standard models.
Findings
Achieves 17.1 perplexity on WikiText-103 with Transformer-XL.
SegaBERT outperforms vanilla BERT on multiple NLP tasks.
Outperforms RoBERTa in zero-shot sentence representation learning.
Abstract
Transformers are powerful for sequence modeling. Nearly all state-of-the-art language models and pre-trained language models are based on the Transformer architecture. However, it distinguishes sequential tokens only with the token position index. We hypothesize that better contextual representations can be generated from the Transformer with richer positional information. To verify this, we propose a segment-aware Transformer (Segatron), by replacing the original token position encoding with a combined position encoding of paragraph, sentence, and token. We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model with memory extension and relative position encoding. We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset. We further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Adaptive Input Representations · Linear Warmup With Cosine Annealing · Adaptive Softmax · Variational Dropout · Transformer-XL · RoBERTa
