Using n-aksaras to model Sanskrit and Sanskrit-adjacent texts
Charles Li (CNRS, CEIAS)

TL;DR
This paper introduces a simplified tokenization method using n-aksaras for Sanskrit and related texts, reducing the complexity of sandhi resolution and enabling cross-lingual analysis of ancient commentaries.
Contribution
It proposes using n-aksaras as a new tokenization approach for Sanskrit texts, facilitating easier analysis without extensive sandhi resolution.
Findings
Identified patterns of text reuse across centuries and languages.
Demonstrated applicability to Sanskrit and Sanskrit-adjacent texts.
Provided initial insights into Buddhist commentarial practices.
Abstract
Despite -- or perhaps because of -- their simplicity, n-grams, or contiguous sequences of tokens, have been used with great success in computational linguistics since their introduction in the late 20th century. Recast as k-mers, or contiguous sequences of monomers, they have also found applications in computational biology. When applied to the analysis of texts, n-grams usually take the form of sequences of words. But if we try to apply this model to the analysis of Sanskrit texts, we are faced with the arduous task of, firstly, resolving sandhi to split a phrase into words, and, secondly, splitting long compounds into their components. This paper presents a simpler method of tokenizing a Sanskrit text for n-grams, by using n-aksaras, or contiguous sequences of aksaras. This model reduces the need for sandhi resolution, making it much easier to use on raw text. It is also possible to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Language and cultural evolution
MethodsTest
