Multilingual Search with Subword TF-IDF
Artit Wangperawong

TL;DR
Subword TF-IDF (STF-IDF) improves multilingual search accuracy across multiple languages by eliminating the need for manual heuristics, leveraging subword tokenization for inherently multilingual support, as demonstrated on XQuAD.
Contribution
The paper introduces STF-IDF, a novel approach that enhances multilingual search accuracy without heuristics, using subword tokenization integrated into the model training process.
Findings
Achieves 85.4% accuracy for English search
Over 80% accuracy across 10 other languages
Open-sourced implementation available
Abstract
Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer higher accuracy without such heuristics. Moreover, multilingual support can be incorporated inherently as part of the subword tokenization model training. XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English and over 80% for 10 other languages without any heuristics-based preprocessing. The software to reproduce these results are open-sourced as a part of Text2Text: https://github.com/artitw/text2text
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Information Retrieval and Search Behavior
