Dynamic Chunking for End-to-End Hierarchical Sequence Modeling
Sukjun Hwang, Brandon Wang, Albert Gu

TL;DR
This paper introduces a dynamic chunking mechanism within hierarchical models that learns content-dependent segmentation, enabling fully end-to-end language modeling that outperforms traditional token-based models, especially in languages with weak tokenization.
Contribution
The paper presents a novel hierarchical network (H-Net) with dynamic chunking that replaces tokenization, enabling end-to-end training and improved performance across languages and modalities.
Findings
H-Net with one hierarchy stage outperforms BPE-based Transformers.
Multiple hierarchy stages improve abstraction and scaling.
Significant data efficiency gains in Chinese, code, and DNA sequences.
Abstract
Major progress on language models (LMs) in recent years has largely resulted from moving away from specialized models designed for specific tasks, to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data. Despite this trend, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content- and context- dependent segmentation strategies learned jointly with the rest of the model. Incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end. When compute- and data- matched, an H-Net with one stage of hierarchy operating at the byte level outperforms a…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper addresses an important limitation of many tokenizer-free architectures: training instability when boundary predictors must make discrete decisions (with or without supervision). Their proposed architecture is elegant in how it handles segmentation via the novel routing and smoothing mechanism. - The paper is well written with lots of ablations, detailed discussions on different architectural and experimental choices that potentially aid reproducibility. - Their experimental results
- Of course, this paper is not framed as a multilingual one, but the authors do claim improvements in other languages, and only evaluate on Chinese. While the improvements are notable, many recent frontier LMs are trained on web data across several languages. Do you have insights on how your architecture scales in a multilingual setting, when it is very common to have very distinct tokens mixed in individual sequences? - In addition to what I mentioned above, I am particularly curious about the
- **Novelty and Significance:** The paper's primary strength lies in presenting a robust and scalable framework for fully end-to-end, learnable segmentation. The Dynamic Chunking mechanism, especially the smoothing module that turns a discrete selection problem into a differentiable one, is an elegant solution to a notoriously difficult problem that has hampered previous efforts. This work represents a significant step towards realizing the "bitter lesson" by replacing a major handcrafted heuris
- **Practical Efficiency Concerns:** The paper candidly acknowledges that the current implementation can be up to 2x slower during training and has dynamic memory usage, which can be unpredictable. This is a significant practical hurdle for widespread adoption and large-scale training, as it complicates hardware optimization and resource allocation. - **Uncertainty and Potential Fragility at Extreme Scale:** While stability is demonstrated up to 1.3B parameters, the fact that larger scales are
- Interesting architecture that contributes to investigating the long-lasting issue of tokenisation in language models (aka tokenisation-free architectures) - Many carefully conducted experiments, with positive results - Paper clear and well written
- I think it is not ideal that the state of the art is only briefly described in the main part of the paper, and discussed at more length in an appendix. It makes the main part of the paper not really self-contained - Moreover, the discussion on what is different and novel with H-Net with respect to previously published works is insufficient, especially in the main part of the paper. The authors write that "H-Nets […] unlock the ability to remove another layer of pre-processing, such as tokenize
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Neural Networks and Applications
