Char2Subword: Extending the Subword Embedding Space Using Robust   Character Compositionality

Gustavo Aguilar; Bryan McCann; Tong Niu; Nazneen Rajani; Nitish; Keskar; Thamar Solorio

arXiv:2010.12730·cs.CL·September 27, 2021

Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality

Gustavo Aguilar, Bryan McCann, Tong Niu, Nazneen Rajani, Nitish, Keskar, Thamar Solorio

PDF

Open Access

TL;DR

This paper introduces char2subword, a character-based module that enhances subword embeddings in language models like BERT, improving robustness to misspellings and inflections without retraining the entire model.

Contribution

The proposed char2subword module is a drop-in replacement for subword embeddings, enabling better handling of character-level variations in pre-trained models.

Findings

01

Improves robustness to misspellings and inflections

02

Enhances performance on social media code-switching tasks

03

Can be integrated without retraining transformer parameters

Abstract

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokenization process of language models as it provides multiple benefits. However, this process is solely based on pre-training data statistics, making it hard for the tokenizer to handle infrequent spellings. On the other hand, though robust to misspellings, pure character-level models often lead to unreasonably long sequences and make it harder for the model to learn meaningful words. To alleviate these challenges, we propose a character-based subword module (char2subword) that learns the subword embedding table in pre-trained models like BERT. Our char2subword module builds representations from characters out of the subword vocabulary, and it can be used as a drop-in replacement of the subword embedding table. The module is robust to character-level alterations such as misspellings, word inflection, casing, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsLinear Layer · mBERT · Layer Normalization · Softmax · Adam · Dense Connections · Dropout · Linear Warmup With Linear Decay · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia?