Word-Level Representation From Bytes For Language Modeling

Chu-Tak Lee; Qipeng Guo; Xipeng Qiu

arXiv:2211.12677·cs.CL·November 24, 2022·1 cites

Word-Level Representation From Bytes For Language Modeling

Chu-Tak Lee, Qipeng Guo, Xipeng Qiu

PDF

Open Access

TL;DR

This paper introduces Byte2Word, a token-free language model that constructs word-level representations directly from bytes using cross-attention, reducing embedding size while maintaining performance.

Contribution

It proposes a novel byte-based word representation method with cross-attention, enabling smaller embeddings and improved robustness, especially for noise and cross-lingual transfer.

Findings

01

Byte2Word matches BERT's performance on language modeling and text classification.

02

It uses only 10% of the embedding size compared to traditional models.

03

The method is robust to synthetic noise and effective in cross-lingual transfer.

Abstract

Modern language models mostly take sub-words as input, a design that balances the trade-off between vocabulary size, number of parameters, and performance. However, sub-word tokenization still has disadvantages like not being robust to noise and difficult to generalize to new languages. Also, the current trend of scaling up models reveals that larger models require larger embeddings but that makes parallelization hard. Previous work on image classification proves splitting raw input into a sequence of chucks is a strong, model-agnostic inductive bias. Based on this observation, we rethink the existing character-aware method that takes character-level inputs but makes word-level sequence modeling and prediction. We overhaul this method by introducing a cross-attention network that builds word-level representation directly from bytes, and a sub-word level prediction based on word-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · Test · Linear Layer · Dense Connections · Residual Connection · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Softmax · Dropout