Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers
Odunayo Ogundepo, Xinyu Zhang, and Jimmy Lin

TL;DR
This paper introduces a language-agnostic tokenization method using WordPiece for information retrieval, which outperforms whitespace tokenization across diverse languages and enhances existing systems.
Contribution
Proposes using WordPiece tokenization for lexical retrieval in multiple languages, reducing reliance on language-specific tokenizers and improving retrieval effectiveness.
Findings
WordPiece tokenizer outperforms whitespace tokenization in most languages.
The approach improves retrieval effectiveness when combined with custom tokenizers.
Strong relevance signals are provided by mBERT tokenizer across diverse languages.
Abstract
Tokenization is a crucial step in information retrieval, especially for lexical matching algorithms, where the quality of indexable tokens directly impacts the effectiveness of a retrieval system. Since different languages have unique properties, the design of the tokenization algorithm is usually language-specific and requires at least some lingustic knowledge. However, only a handful of the 7000+ languages on the planet benefit from specialized, custom-built tokenization algorithms, while the other languages are stuck with a "default" whitespace tokenizer, which cannot capture the intricacies of different languages. To address this challenge, we propose a different approach to tokenization for lexical matching retrieval algorithms (e.g., BM25): using the WordPiece tokenizer, which can be built automatically from unsupervised data. We test the approach on 11 typologically diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Data Quality and Management
MethodsTest · WordPiece · mBERT
