Multilingual Search with Subword TF-IDF

Artit Wangperawong

arXiv:2209.14281·cs.CL·September 30, 2022

Multilingual Search with Subword TF-IDF

Artit Wangperawong

PDF

Open Access 1 Repo

TL;DR

Subword TF-IDF (STF-IDF) improves multilingual search accuracy across multiple languages by eliminating the need for manual heuristics, leveraging subword tokenization for inherently multilingual support, as demonstrated on XQuAD.

Contribution

The paper introduces STF-IDF, a novel approach that enhances multilingual search accuracy without heuristics, using subword tokenization integrated into the model training process.

Findings

01

Achieves 85.4% accuracy for English search

02

Over 80% accuracy across 10 other languages

03

Open-sourced implementation available

Abstract

Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer higher accuracy without such heuristics. Moreover, multilingual support can be incorporated inherently as part of the subword tokenization model training. XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English and over 80% for 10 other languages without any heuristics-based preprocessing. The software to reproduce these results are open-sourced as a part of Text2Text: https://github.com/artitw/text2text

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

artitw/text2text
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Information Retrieval and Search Behavior