ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer
Chunyuan Deng, Sanket Lokegaonkar, Colin Lockard, Besnik Fetahu, Nasser Zalmout, Xian Li

TL;DR
ByteFlow Net introduces a tokenizer-free, adaptive byte compression architecture for language modeling, enabling models to learn their own segmentation of raw byte streams, leading to improved performance over traditional tokenization methods.
Contribution
It proposes a novel hierarchical model that removes the need for fixed tokenizers by learning segmentation based on compression, enhancing adaptability and effectiveness.
Findings
Outperforms BPE-based Transformers and previous byte-level models.
Demonstrates improved language modeling performance.
Shows that adaptive, compression-driven segmentation benefits downstream tasks.
Abstract
Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce \textbf{ByteFlow Net}, a new hierarchical architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. ByteFlow Net performs compression-driven segmentation based on the coding rate of latent representations, yielding adaptive boundaries \emph{while preserving a static computation graph via Top- selection}. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments…
Peer Reviews
Decision·ICLR 2026 Poster
Intuition for their work is solid. Strong abelations
I would like to see a more in depth discussion of computational overhead and how this could be improved
1. The paper tackles a real bottleneck: static tokenization is non-learnable, language-specific, and fixes the granularity. The proposed method uses coding rate as the segmentation signal, which seems a principled alternative. The connection to rate–distortion is neat. 2. The hierarchical encoder–decoder is well-motivated and aligned with efficient modeling strategies, and applying SWA + Canon layers provides a realistic path for scaling byte-level models efficiently. 3. The empirical study sho
1. The paper does not report the actual cost of computing the coding-rate scores (the log-det–style term and its approximation). Since this has to run per sequence and per step, training-time overhead and distributed stability (e.g. with variable K, mixed-length batches) should be quantified. 2. Its practical advantage over existing byte-level or tokenizer-based architectures remains small at current scales (less than 1.3B). Demonstrating competitive performance at more than 7B parameters or on
- Principled method - The paper is mostly well written and easy to read (except one part, see below) - The method performs better than the baselines
- The method relies on top-k over sequence, which (1) leaks minimal information from the future, (2) it is unclear how to apply it in an autoregressive setting. It is unclear how their code rate-based approach can be generalized to an autoregressive setup. Maybe with an auxiliary predictor, like in Mixture-of-Depths [1], or a learned threshold using a PI controller like [2]. The authors do not discuss this limitation anywhere in the paper. - While the paper is mostly well written, eq. 15 is unc
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms
