Fast Byte Latent Transformer
Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz, Gargi Ghosh, Luke Zettlemoyer, Christopher Potts, Xiaochuang Han, Srinivasan Iyer

TL;DR
This paper introduces new techniques for byte-level language models that significantly speed up generation while maintaining quality, making byte-level models more practical for real-world use.
Contribution
The paper presents BLT Diffusion, a fast parallel generation method, and two extensions, BLT Self-speculation and BLT Diffusion+Verification, to improve speed and quality of byte-level language models.
Findings
BLT-D reduces generation passes by generating multiple bytes in parallel.
Extensions like BLT-S and BLT-DV improve generation quality with minimal speed loss.
Memory-bandwidth cost can be reduced by over 50% with these methods.
Abstract
Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
