The Efficiency Gap in Byte Modeling

Celine Lee; Jing Nathan Yan; Chen Liang; Jiaxin Shi; Yin Zhang; Jeremiah Liu; Pengcheng Yin; Fernando Pereira; Ed Chi; Derek Cheng; Alexander M. Rush; Ruoxi Wang

arXiv:2605.12928·cs.LG·May 14, 2026

The Efficiency Gap in Byte Modeling

Celine Lee, Jing Nathan Yan, Chen Liang, Jiaxin Shi, Yin Zhang, Jeremiah Liu, Pengcheng Yin, Fernando Pereira, Ed Chi, Derek Cheng, Alexander M. Rush, Ruoxi Wang

PDF

TL;DR

This paper examines the computational costs of byte-level modeling and masked diffusion modeling in language models, revealing that byte modeling incurs higher scaling overhead for diffusion models due to context fragility.

Contribution

It provides a compute-matched scaling study comparing byte modeling and autoregressive approaches, highlighting the importance of structural biases for efficient scaling.

Findings

01

Byte modeling has worse scaling overhead for MDM than AR models.

02

Context fragility affects byte modeling efficiency in diffusion models.

03

Structural biases are crucial for viable scaling in byte-based, modality-agnostic models.

Abstract

Modern language models have historically relied on two dominant design choices: subword tokenization and autoregressive (AR) ordering. These design decisions bake in priors that dictate a model's learning. Recently, two alternative paradigms have challenged this: byte-level modeling, which bypasses static statistically-derived token vocabularies, and masked diffusion modeling (MDM), which conducts parallel, non-sequential generation. Their intersection represents a fully end-to-end modality-agnostic generative prototype; however, removing these structural priors incurs a significant computational cost. In this work, we investigate this cost through a compute-matched scaling study. Our results reveal that the performance penalty of byte modeling is not uniform; across scale, the scaling overhead of byte modeling is worse for MDM than for AR. We hypothesize that this disparity stems from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.