CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language   Representation

Jonathan H. Clark; Dan Garrette; Iulia Turc; John Wieting

arXiv:2103.06874·cs.CL·May 19, 2022

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting

PDF

5 Repos 2 Models

TL;DR

CANINE is a character-based neural encoder that eliminates the need for explicit tokenization, using downsampling and deep transformers to efficiently process language data and outperform traditional models on multilingual benchmarks.

Contribution

Introduces CANINE, a tokenization-free encoder operating directly on characters with a novel pre-training strategy and architecture that improves multilingual language understanding.

Findings

01

Outperforms mBERT by 2.8 F1 on TyDi QA

02

Uses 28% fewer parameters than comparable models

03

Operates without explicit tokenization or fixed vocabularies

Abstract

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · mBERT · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Label Smoothing · Dropout · Multi-Head Attention