LEVOS: Leveraging Vocabulary Overlap with Sanskrit to Generate Technical Lexicons in Indian Languages
Karthika N J, Krishnakant Bhatt, Ganesh Ramakrishnan, and Preethi Jyothi

TL;DR
This paper introduces a Sanskrit-based segmentation approach using a character-level Transformer to improve technical term translation in low-resource Indian languages, achieving significant accuracy gains and human-verified quality improvements.
Contribution
It presents a novel Sanskrit-informed segmentation method with a Transformer model, enhancing translation accuracy for technical terms in Indian languages.
Findings
Average improvements of 8.46 and 6.79 in chrF++ scores
Consistent translation quality gains across experiments
Positive human evaluation results
Abstract
Translating technical terms into lexically similar, low-resource Indian languages remains a challenge due to limited parallel data and the complexity of linguistic structures. We propose a novel use-case of Sanskrit-based segments for linguistically informed translation of such terms, leveraging subword-level similarity and morphological alignment across related languages. Our approach uses character-level segmentation to identify meaningful subword units, facilitating more accurate and context-aware translation. To enable this, we utilize a Character-level Transformer model for Sanskrit Word Segmentation (CharSS), which addresses the complexities of sandhi and morpho-phonemic changes during segmentation. We observe consistent improvements in two experimental settings for technical term translation using Sanskrit-derived segments, averaging 8.46 and 6.79 chrF++ scores, respectively.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Lexicography and Language Studies
MethodsLinear Layer · Multi-Head Attention · Attention Is All You Need · Softmax · Byte Pair Encoding · Layer Normalization · Label Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam
