ENTP: Encoder-only Next Token Prediction
Ethan Ewer, Daewon Chae, Thomas Zeng, Jinkyu Kim, Kangwook Lee

TL;DR
This paper introduces Encoder-only Next Token Prediction (ENTP), a novel approach that outperforms decoder-only Transformers in tasks with unbounded compute, demonstrating advantages in expressive power and practical performance.
Contribution
The paper presents ENTP, an encoder-only architecture for next-token prediction, highlighting its theoretical and empirical benefits over traditional decoder-only models.
Findings
ENTP performs well on the Count3 task, unlike decoder-only models.
ENTP outperforms decoder-only Transformers in addition, in-context learning, and language modeling tasks.
Theoretical analysis shows ENTP's superior expressive power.
Abstract
Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only Transformers? In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP in settings with unbounded compute. We introduce the task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Anomaly Detection Techniques and Applications
MethodsLinear Layer · Multi-Head Attention · Layer Normalization · Dense Connections · Attention Is All You Need · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding
