TL;DR
MrT5 introduces a dynamic token merging mechanism in byte-level language models, significantly reducing sequence length and inference time while maintaining performance across multilingual and downstream tasks.
Contribution
It proposes a novel token deletion mechanism that dynamically shortens input sequences in byte-level models, improving efficiency without sacrificing accuracy.
Findings
Reduces sequence length by up to 75%.
Achieves faster inference with minimal performance loss.
Adapts to language-specific orthographic features in multilingual settings.
Abstract
Models that rely on subword tokenization have significant drawbacks, such as sensitivity to character-level noise like spelling errors and inconsistent compression rates across different languages and scripts. While character- or byte-level models like ByT5 attempt to address these concerns, they have not gained widespread adoption -- processing raw byte streams without tokenization results in significantly longer sequence lengths, making training and inference inefficient. This work introduces MrT5 (MergeT5), a more efficient variant of ByT5 that integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length. After processing through a fixed number of encoder layers, a learned delete gate determines which tokens are to be removed and which are to be retained for subsequent layers. MrT5 effectively "merges" critical information from deleted tokens…
Peer Reviews
Decision·ICLR 2025 Poster
A straightforward and lightweight addition to ByT5 models which can provide significant efficiency improvements (up to 80%) with a small trade-off in accuracy (~2%) compared to ByT5 (and often small improvements for English).
Efficiency gains (though generally significant) can come at a slight performance cost for non-English languages, and it is not clear how much variance in this cost there may be. - Without continued training in non-English scripts, there can be performance drops for non-English languages with less efficiency improvements. Figure 3 suggests that for Chinese for example, there may be almost no reduction in seq length zero-shot for the span corruption task, and some languages have relatively large
* A simple yet effective method that enables models to dynamically learn how to delete tokens from byte-level inputs. * Controlled experiments and results on downstream tasks support the authors' claims. * MrT5 demonstrates competitive inference speed compared to ByT5.
* If I understand correctly, during training, the byte tokens are deleted softly, meaning there are still significant burdens for byte-level language models given that standard attention has quadratic time complexity, which limits their scalability to larger sizes. * The experiments presented utilize moderate model sizes, which may constrain the overall persuasiveness of the proposed method.
• Clear and illustrative figure • Very well-written and easy to read • Demonstrate that the proposed method performs competitively with ByT5 on XNLI and Spelling Correction. • The soft and hard deletion switch can be considered novel, at least for applying to this line of work.
• The baselines for MrT5 are not properly constructed (Section 5). To see how well MrT5 does, one should compare it with other orthogonal methods/models (e.g., pooling) that reduce sequence length and see how much increase in x-entropy loss they incur. • For downstream tasks we should compare with non-byte-level models to see the gap: how far are we in terms of accuracy? What's the run-time comparison after this token deletion optimization? These questions are left unanswered in the paper. • J
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
