ByT5: Towards a token-free future with pre-trained byte-to-byte models
Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang,, Mihir Kale, Adam Roberts, Colin Raffel

TL;DR
This paper introduces ByT5, a byte-level Transformer model that operates directly on raw text, offering robustness to noise and language independence, while maintaining competitive performance with token-based models.
Contribution
The paper demonstrates that standard Transformer architectures can be effectively adapted for byte-level processing, and releases new pre-trained byte-level models based on T5.
Findings
Byte-level models are more robust to noise.
Byte-level models perform well on spelling and pronunciation sensitive tasks.
Byte-level models are competitive with token-level models in efficiency.
Abstract
Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/byt5-smallmodel· 288k dl· ♡ 94288k dl♡ 94
- 🤗google/byt5-xlmodel· 9.5k dl· ♡ 139.5k dl♡ 13
- 🤗HeyLucasLeao/byt5-base-pt-product-reviewsmodel· 1 dl· ♡ 21 dl♡ 2
- 🤗HeyLucasLeao/byt5-small-pt-product-reviewsmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗baffo32/pyc2py_alpha2model· 2 dl2 dl
- 🤗google/byt5-basemodel· 78k dl· ♡ 3778k dl♡ 37
- 🤗google/byt5-largemodel· 36k dl· ♡ 1836k dl♡ 18
- 🤗google/byt5-xxlmodel· 920 dl· ♡ 19920 dl♡ 19
- 🤗pierreguillou/byt5-small-qa-squad-v1.1-portuguesemodel· 27 dl· ♡ 427 dl♡ 4
- 🤗ufal/byt5-small-multilexnorm2021-damodel· 3 dl· ♡ 13 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adafactor · Label Smoothing · Inverse Square Root Schedule
