Sanskrit Sandhi Splitting using seq2(seq)^2
Rahul Aralikatte, Neelamadhav Gantayat, Naveen Panwar, Anush Sankaran,, Senthil Mani

TL;DR
This paper introduces a novel deep learning model, DD-RNN, for Sanskrit Sandhi splitting that accurately predicts split locations and constituent words, outperforming existing methods and demonstrating cross-lingual generalization to Chinese segmentation.
Contribution
The paper presents the DD-RNN architecture, achieving high accuracy in Sanskrit Sandhi splitting and showcasing its effectiveness in Chinese word segmentation tasks.
Findings
Split location prediction accuracy of 95%
Constituent word prediction accuracy of 79.5%
Outperforms state-of-the-art methods by 20%
Abstract
In Sanskrit, small words (morphemes) are combined to form compound words through a process known as Sandhi. Sandhi splitting is the process of splitting a given compound word into its constituent morphemes. Although rules governing word splitting exists in the language, it is highly challenging to identify the location of the splits in a compound word. Though existing Sandhi splitting systems incorporate these pre-defined splitting rules, they have a low accuracy as the same compound word might be broken down in multiple ways to provide syntactically correct splits. In this research, we propose a novel deep learning architecture called Double Decoder RNN (DD-RNN), which (i) predicts the location of the split(s) with 95% accuracy, and (ii) predicts the constituent words (learning the Sandhi splitting rules) with 79.5% accuracy, outperforming the state-of-art by 20%. Additionally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
