ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

Chunyuan Deng; Sanket Lokegaonkar; Colin Lockard; Besnik Fetahu; Nasser Zalmout; Xian Li

arXiv:2603.03583·cs.CL·March 5, 2026

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

Chunyuan Deng, Sanket Lokegaonkar, Colin Lockard, Besnik Fetahu, Nasser Zalmout, Xian Li

PDF

Open Access 3 Reviews

TL;DR

ByteFlow Net introduces a tokenizer-free, adaptive byte compression architecture for language modeling, enabling models to learn their own segmentation of raw byte streams, leading to improved performance over traditional tokenization methods.

Contribution

It proposes a novel hierarchical model that removes the need for fixed tokenizers by learning segmentation based on compression, enhancing adaptability and effectiveness.

Findings

01

Outperforms BPE-based Transformers and previous byte-level models.

02

Demonstrates improved language modeling performance.

03

Shows that adaptive, compression-driven segmentation benefits downstream tasks.

Abstract

Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce \textbf{ByteFlow Net}, a new hierarchical architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. ByteFlow Net performs compression-driven segmentation based on the coding rate of latent representations, yielding adaptive boundaries \emph{while preserving a static computation graph via Top- $K$ selection}. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

Intuition for their work is solid. Strong abelations

Weaknesses

I would like to see a more in depth discussion of computational overhead and how this could be improved

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper tackles a real bottleneck: static tokenization is non-learnable, language-specific, and fixes the granularity. The proposed method uses coding rate as the segmentation signal, which seems a principled alternative. The connection to rate–distortion is neat. 2. The hierarchical encoder–decoder is well-motivated and aligned with efficient modeling strategies, and applying SWA + Canon layers provides a realistic path for scaling byte-level models efficiently. 3. The empirical study sho

Weaknesses

1. The paper does not report the actual cost of computing the coding-rate scores (the log-det–style term and its approximation). Since this has to run per sequence and per step, training-time overhead and distributed stability (e.g. with variable K, mixed-length batches) should be quantified. 2. Its practical advantage over existing byte-level or tokenizer-based architectures remains small at current scales (less than 1.3B). Demonstrating competitive performance at more than 7B parameters or on

Reviewer 03Rating 4Confidence 4

Strengths

- Principled method - The paper is mostly well written and easy to read (except one part, see below) - The method performs better than the baselines

Weaknesses

- The method relies on top-k over sequence, which (1) leaks minimal information from the future, (2) it is unclear how to apply it in an autoregressive setting. It is unclear how their code rate-based approach can be generalized to an autoregressive setup. Maybe with an auxiliary predictor, like in Mixture-of-Depths [1], or a learned threshold using a PI controller like [2]. The authors do not discuss this limitation anywhere in the paper. - While the paper is mostly well written, eq. 15 is unc

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms