ByT5: Towards a token-free future with pre-trained byte-to-byte models

Linting Xue; Aditya Barua; Noah Constant; Rami Al-Rfou; Sharan Narang,; Mihir Kale; Adam Roberts; Colin Raffel

arXiv:2105.13626·cs.CL·March 9, 2022·69 cites

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang,, Mihir Kale, Adam Roberts, Colin Raffel

PDF

Open Access 5 Repos 10 Models

TL;DR

This paper introduces ByT5, a byte-level Transformer model that operates directly on raw text, offering robustness to noise and language independence, while maintaining competitive performance with token-based models.

Contribution

The paper demonstrates that standard Transformer architectures can be effectively adapted for byte-level processing, and releases new pre-trained byte-level models based on T5.

Findings

01

Byte-level models are more robust to noise.

02

Byte-level models perform well on spelling and pronunciation sensitive tasks.

03

Byte-level models are competitive with token-level models in efficiency.

Abstract

Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adafactor · Label Smoothing · Inverse Square Root Schedule