Wine is Not v i n. -- On the Compatibility of Tokenizations Across   Languages

Antonis Maronikolakis; Philipp Dufter; Hinrich Sch\"utze

arXiv:2109.05772·cs.CL·September 14, 2021

Wine is Not v i n. -- On the Compatibility of Tokenizations Across Languages

Antonis Maronikolakis, Philipp Dufter, Hinrich Sch\"utze

PDF

Open Access

TL;DR

This paper introduces a measure to evaluate and improve the compatibility of tokenizations across languages in multilingual language models, aiming to enhance semantic representation learning.

Contribution

It proposes a novel compatibility measure for tokenizations across languages, addressing a gap in multilingual model vocabulary design.

Findings

01

Compatibility measure effectively identifies incompatible tokenizations.

02

Using the measure improves multilingual semantic representations.

03

The approach facilitates the creation of compatible multilingual vocabularies.

Abstract

The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are used. In this work, we investigate the compatibility of tokenizations for multilingual static and contextualized embedding spaces and propose a measure that reflects the compatibility of tokenizations across languages. Our goal is to prevent incompatible tokenizations, e.g., "wine" (word-level) in English vs.\ "v i n" (character-level) in French, which make it hard to learn good multilingual semantic representations. We show that our compatibility measure allows the system designer to create vocabularies across languages that are compatible -- a desideratum that so far has been neglected in multilingual models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsWordPiece