A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning
Md Mofijul Islam, Gustavo Aguilar, Pragaash Ponnusamy, Clint Solomon, Mathialagan, Chengyuan Ma, Chenlei Guo

TL;DR
This paper introduces a vocabulary-free neural tokenizer that enables end-to-end task learning in multilingual NLP, improving performance especially in low-resource languages and robustness against noise.
Contribution
It presents a novel character-based neural tokenizer that discards fixed vocabularies, allowing for more adaptable and task-specific tokenization across languages.
Findings
Improves multilingual and code-switching task performance
Enhances robustness to typos and misspellings
Outperforms traditional subword tokenizers in low-resource settings
Abstract
Subword tokenization is a commonly used input pre-processing step in most recent NLP models. However, it limits the models' ability to leverage end-to-end task learning. Its frequency-based vocabulary creation compromises tokenization in low-resource languages, leading models to produce suboptimal representations. Additionally, the dependency on a fixed vocabulary limits the subword models' adaptability across languages and domains. In this work, we propose a vocabulary-free neural tokenizer by distilling segmentation information from heuristic-based subword tokenization. We pre-train our character-based tokenizer by processing unique words from multilingual corpus, thereby extensively increasing word diversity across languages. Unlike the predefined and fixed vocabularies in subword methods, our tokenizer allows end-to-end task learning, resulting in optimal task-specific tokenization.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
