Learning to Look Inside: Augmenting Token-Based Encoders with   Character-Level Information

Yuval Pinter; Amanda Stent; Mark Dredze; Jacob Eisenstein

arXiv:2108.00391·cs.CL·August 3, 2021·5 cites

Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information

Yuval Pinter, Amanda Stent, Mark Dredze, Jacob Eisenstein

PDF

Open Access

TL;DR

This paper introduces XRayEmb, a method to enhance existing token-based transformer models with character-level information, improving performance on various NLP tasks, especially with non-standard English text.

Contribution

XRayEmb retrofits pre-trained token models with character-level encodings, enabling better handling of domain shifts and novel words without retraining from scratch.

Findings

01

Improves performance on autoregressive and masked transformer models

02

Enhances results on sequence and sequence tagging tasks

03

Especially effective on non-standard English text

Abstract

Commonly-used transformer language models depend on a tokenization schema which sets an unchangeable subword vocabulary prior to pre-training, destined to be applied to all downstream tasks regardless of domain shift, novel word formations, or other sources of vocabulary mismatch. Recent work has shown that "token-free" models can be trained directly on characters or bytes, but training these models from scratch requires substantial computational resources, and this implies discarding the many domain-specific models that were trained on tokens. In this paper, we present XRayEmb, a method for retrofitting existing token-based models with character-level information. XRayEmb is composed of a character-level "encoder" that computes vector representations of character sequences, and a generative component that decodes from the internal representation to a character sequence. We show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification