ENTP: Encoder-only Next Token Prediction

Ethan Ewer; Daewon Chae; Thomas Zeng; Jinkyu Kim; Kangwook Lee

arXiv:2410.01600·cs.LG·February 5, 2025

ENTP: Encoder-only Next Token Prediction

Ethan Ewer, Daewon Chae, Thomas Zeng, Jinkyu Kim, Kangwook Lee

PDF

Open Access

TL;DR

This paper introduces Encoder-only Next Token Prediction (ENTP), a novel approach that outperforms decoder-only Transformers in tasks with unbounded compute, demonstrating advantages in expressive power and practical performance.

Contribution

The paper presents ENTP, an encoder-only architecture for next-token prediction, highlighting its theoretical and empirical benefits over traditional decoder-only models.

Findings

01

ENTP performs well on the Count3 task, unlike decoder-only models.

02

ENTP outperforms decoder-only Transformers in addition, in-context learning, and language modeling tasks.

03

Theoretical analysis shows ENTP's superior expressive power.

Abstract

Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only Transformers? In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP in settings with unbounded compute. We introduce the $Count3$ task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Anomaly Detection Techniques and Applications

MethodsLinear Layer · Multi-Head Attention · Layer Normalization · Dense Connections · Attention Is All You Need · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding