Word-Level Representation From Bytes For Language Modeling
Chu-Tak Lee, Qipeng Guo, Xipeng Qiu

TL;DR
This paper introduces Byte2Word, a token-free language model that constructs word-level representations directly from bytes using cross-attention, reducing embedding size while maintaining performance.
Contribution
It proposes a novel byte-based word representation method with cross-attention, enabling smaller embeddings and improved robustness, especially for noise and cross-lingual transfer.
Findings
Byte2Word matches BERT's performance on language modeling and text classification.
It uses only 10% of the embedding size compared to traditional models.
The method is robust to synthetic noise and effective in cross-lingual transfer.
Abstract
Modern language models mostly take sub-words as input, a design that balances the trade-off between vocabulary size, number of parameters, and performance. However, sub-word tokenization still has disadvantages like not being robust to noise and difficult to generalize to new languages. Also, the current trend of scaling up models reveals that larger models require larger embeddings but that makes parallelization hard. Previous work on image classification proves splitting raw input into a sequence of chucks is a strong, model-agnostic inductive bias. Based on this observation, we rethink the existing character-aware method that takes character-level inputs but makes word-level sequence modeling and prediction. We overhaul this method by introducing a cross-attention network that builds word-level representation directly from bytes, and a sub-word level prediction based on word-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Test · Linear Layer · Dense Connections · Residual Connection · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Softmax · Dropout
