TL;DR
CANINE is a character-based neural encoder that eliminates the need for explicit tokenization, using downsampling and deep transformers to efficiently process language data and outperform traditional models on multilingual benchmarks.
Contribution
Introduces CANINE, a tokenization-free encoder operating directly on characters with a novel pre-training strategy and architecture that improves multilingual language understanding.
Findings
Outperforms mBERT by 2.8 F1 on TyDi QA
Uses 28% fewer parameters than comparable models
Operates without explicit tokenization or fixed vocabularies
Abstract
Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · mBERT · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Residual Connection · Label Smoothing · Dropout · Multi-Head Attention
