Fast Byte Latent Transformer

Julie Kallini; Artidoro Pagnoni; Tomasz Limisiewicz; Gargi Ghosh; Luke Zettlemoyer; Christopher Potts; Xiaochuang Han; Srinivasan Iyer

arXiv:2605.08044·cs.CL·May 11, 2026

Fast Byte Latent Transformer

Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz, Gargi Ghosh, Luke Zettlemoyer, Christopher Potts, Xiaochuang Han, Srinivasan Iyer

PDF

TL;DR

This paper introduces new techniques for byte-level language models that significantly speed up generation while maintaining quality, making byte-level models more practical for real-world use.

Contribution

The paper presents BLT Diffusion, a fast parallel generation method, and two extensions, BLT Self-speculation and BLT Diffusion+Verification, to improve speed and quality of byte-level language models.

Findings

01

BLT-D reduces generation passes by generating multiple bytes in parallel.

02

Extensions like BLT-S and BLT-DV improve generation quality with minimal speed loss.

03

Memory-bandwidth cost can be reduced by over 50% with these methods.

Abstract

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.