MAGNET: Improving the Multilingual Fairness of Language Models with   Adaptive Gradient-Based Tokenization

Orevaoghene Ahia; Sachin Kumar; Hila Gonen; Valentin Hofmann; Tomasz; Limisiewicz; Yulia Tsvetkov; Noah A. Smith

arXiv:2407.08818·cs.CL·November 19, 2024·1 cites

MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization

Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin Hofmann, Tomasz, Limisiewicz, Yulia Tsvetkov, Noah A. Smith

PDF

Open Access 1 Video

TL;DR

MAGNET introduces an adaptive, gradient-based tokenization method for multilingual language models, reducing over-segmentation in non-Latin scripts and enhancing efficiency and utility across diverse languages.

Contribution

It proposes a modular, language-specific tokenization approach that improves segmentation fairness and model performance in multilingual settings.

Findings

01

Reduces over-segmentation in non-Latin scripts

02

Enables faster language modeling

03

Improves downstream task utility

Abstract

In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. Specifically, previous studies have reported multiple modeling biases that the current tokenization algorithms introduce to non-Latin script languages, the main one being over-segmentation. In this work, we propose MAGNET; multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization. MAGNET learns to predict segment boundaries between byte tokens in a sequence via sub-modules within the model, which act as internal boundary predictors (tokenizers). Previous gradient-based tokenization methods aimed for uniform compression across sequences by integrating a single boundary predictor during training and optimizing it end-to-end through stochastic reparameterization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization· slideslive

Taxonomy

TopicsTopic Modeling