Using n-aksaras to model Sanskrit and Sanskrit-adjacent texts

Charles Li (CNRS; CEIAS)

arXiv:2301.12969·cs.CL·January 31, 2023

Using n-aksaras to model Sanskrit and Sanskrit-adjacent texts

Charles Li (CNRS, CEIAS)

PDF

Open Access

TL;DR

This paper introduces a simplified tokenization method using n-aksaras for Sanskrit and related texts, reducing the complexity of sandhi resolution and enabling cross-lingual analysis of ancient commentaries.

Contribution

It proposes using n-aksaras as a new tokenization approach for Sanskrit texts, facilitating easier analysis without extensive sandhi resolution.

Findings

01

Identified patterns of text reuse across centuries and languages.

02

Demonstrated applicability to Sanskrit and Sanskrit-adjacent texts.

03

Provided initial insights into Buddhist commentarial practices.

Abstract

Despite -- or perhaps because of -- their simplicity, n-grams, or contiguous sequences of tokens, have been used with great success in computational linguistics since their introduction in the late 20th century. Recast as k-mers, or contiguous sequences of monomers, they have also found applications in computational biology. When applied to the analysis of texts, n-grams usually take the form of sequences of words. But if we try to apply this model to the analysis of Sanskrit texts, we are faced with the arduous task of, firstly, resolving sandhi to split a phrase into words, and, secondly, splitting long compounds into their components. This paper presents a simpler method of tokenizing a Sanskrit text for n-grams, by using n-aksaras, or contiguous sequences of aksaras. This model reduces the need for sandhi resolution, making it much easier to use on raw text. It is also possible to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Language and cultural evolution

MethodsTest