Comparative analysis of subword tokenization approaches for Indian languages
Sudhansu Bala Das, Samujjal Choudhury, Tapas Kumar Mishra, Bidyut Kr. Patra

TL;DR
This study compares subword tokenization methods like SentencePiece, BPE, and WordPiece for Indian languages in machine translation, highlighting their impact on translation quality across different models.
Contribution
It provides a comprehensive analysis of how various subword tokenization techniques influence translation performance in Indian languages within multiple MT frameworks.
Findings
SentencePiece outperforms others in statistical and neural MT models.
BPE yields better results in multilingual neural MT.
ILs to English translations are generally more accurate than English to ILs.
Abstract
Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process. This is a key phase in machine translation (MT) models. Subword tokenization enhances this process by breaking down words into smaller subword units, which is especially beneficial in languages with complicated morphology or a vast vocabulary. It is useful in capturing the intricate structure of words in Indian languages (ILs), such as prefixes, suffixes, and other morphological variations. These languages frequently use agglutinative structures, in which words are formed by the combination of multiple morphemes such as suffixes, prefixes, and stems. As a result, a suitable tokenization strategy must be chosen to address these scenarios. This paper examines how different subword tokenization techniques, such as SentencePiece, Byte Pair Encoding (BPE), and WordPiece…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
MethodsWordPiece · SentencePiece · Byte Pair Encoding
