One Size Does Not Fit All: Finding the Optimal Subword Sizes for   FastText Models across Languages

V\'it Novotn\'y (1); Eniafe Festus Ayetiran (1); Dalibor; Ba\v{c}ovsk\'y (1); D\'avid Lupt\'ak (1); Michal \v{S}tef\'anik (1) and; Petr Sojka (1) ((1) Faculty of Informatics Masaryk University)

arXiv:2102.02585·cs.CL·September 21, 2021

One Size Does Not Fit All: Finding the Optimal Subword Sizes for FastText Models across Languages

V\'it Novotn\'y (1), Eniafe Festus Ayetiran (1), Dalibor, Ba\v{c}ovsk\'y (1), D\'avid Lupt\'ak (1), Michal \v{S}tef\'anik (1) and, Petr Sojka (1) ((1) Faculty of Informatics Masaryk University)

PDF

TL;DR

This paper investigates how optimizing subword sizes in fastText models across multiple languages improves word representation quality, proposing a simple n-gram coverage model that enhances performance with minimal tuning.

Contribution

It identifies optimal subword sizes for several languages and introduces a straightforward n-gram coverage model that predicts effective subword sizes, reducing the need for costly parameter tuning.

Findings

01

Optimized subword sizes improve word analogy accuracy by up to 14%.

02

The n-gram coverage model predicts near-optimal subword sizes within 1% accuracy.

03

Model improvements are consistent across multiple languages.

Abstract

Unsupervised representation learning of words from large multilingual corpora is useful for downstream tasks such as word sense disambiguation, semantic text similarity, and information retrieval. The representation precision of log-bilinear fastText models is mostly due to their use of subword information. In previous work, the optimization of fastText's subword sizes has not been fully explored, and non-English fastText models were trained using subword sizes optimized for English and German word analogy tasks. In our work, we find the optimal subword sizes on the English, German, Czech, Italian, Spanish, French, Hindi, Turkish, and Russian word analogy tasks. We then propose a simple n-gram coverage model and we show that it predicts better-than-default subword sizes on the Spanish, French, Hindi, Turkish, and Russian word analogy tasks. We show that the optimization of fastText's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsfastText