Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information
Yuval Pinter, Amanda Stent, Mark Dredze, Jacob Eisenstein

TL;DR
This paper introduces XRayEmb, a method to enhance existing token-based transformer models with character-level information, improving performance on various NLP tasks, especially with non-standard English text.
Contribution
XRayEmb retrofits pre-trained token models with character-level encodings, enabling better handling of domain shifts and novel words without retraining from scratch.
Findings
Improves performance on autoregressive and masked transformer models
Enhances results on sequence and sequence tagging tasks
Especially effective on non-standard English text
Abstract
Commonly-used transformer language models depend on a tokenization schema which sets an unchangeable subword vocabulary prior to pre-training, destined to be applied to all downstream tasks regardless of domain shift, novel word formations, or other sources of vocabulary mismatch. Recent work has shown that "token-free" models can be trained directly on characters or bytes, but training these models from scratch requires substantial computational resources, and this implies discarding the many domain-specific models that were trained on tokens. In this paper, we present XRayEmb, a method for retrofitting existing token-based models with character-level information. XRayEmb is composed of a character-level "encoder" that computes vector representations of character sequences, and a generative component that decodes from the internal representation to a character sequence. We show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
