Character-Level Language Modeling with Deeper Self-Attention

Rami Al-Rfou; Dokook Choe; Noah Constant; Mandy Guo; Llion Jones

arXiv:1808.04444·cs.CL·December 11, 2018

Character-Level Language Modeling with Deeper Self-Attention

Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, Llion Jones

PDF

1 Repo

TL;DR

This paper demonstrates that a deep 64-layer transformer with fixed context surpasses RNNs in character-level language modeling, achieving state-of-the-art results on text8 and enwik8 benchmarks by using auxiliary losses.

Contribution

It introduces a deep transformer architecture with auxiliary losses for improved character-level language modeling performance.

Findings

01

Deep transformer outperforms RNNs on benchmarks

02

Auxiliary losses improve training at depth

03

Achieves state-of-the-art results on text8 and enwik8

Abstract

LSTMs and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts. In this paper, we show that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8. To get good results at this depth, we show that it is important to add auxiliary losses, both at intermediate network layers and intermediate sequence positions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/code-prediction-transformer
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax