One Size Does Not Fit All: Finding the Optimal Subword Sizes for FastText Models across Languages
V\'it Novotn\'y (1), Eniafe Festus Ayetiran (1), Dalibor, Ba\v{c}ovsk\'y (1), D\'avid Lupt\'ak (1), Michal \v{S}tef\'anik (1) and, Petr Sojka (1) ((1) Faculty of Informatics Masaryk University)

TL;DR
This paper investigates how optimizing subword sizes in fastText models across multiple languages improves word representation quality, proposing a simple n-gram coverage model that enhances performance with minimal tuning.
Contribution
It identifies optimal subword sizes for several languages and introduces a straightforward n-gram coverage model that predicts effective subword sizes, reducing the need for costly parameter tuning.
Findings
Optimized subword sizes improve word analogy accuracy by up to 14%.
The n-gram coverage model predicts near-optimal subword sizes within 1% accuracy.
Model improvements are consistent across multiple languages.
Abstract
Unsupervised representation learning of words from large multilingual corpora is useful for downstream tasks such as word sense disambiguation, semantic text similarity, and information retrieval. The representation precision of log-bilinear fastText models is mostly due to their use of subword information. In previous work, the optimization of fastText's subword sizes has not been fully explored, and non-English fastText models were trained using subword sizes optimized for English and German word analogy tasks. In our work, we find the optimal subword sizes on the English, German, Czech, Italian, Spanish, French, Hindi, Turkish, and Russian word analogy tasks. We then propose a simple n-gram coverage model and we show that it predicts better-than-default subword sizes on the Spanish, French, Hindi, Turkish, and Russian word analogy tasks. We show that the optimization of fastText's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsfastText
