TransLIST: A Transformer-Based Linguistically Informed Sanskrit Tokenizer
Jivnesh Sandhan, Rathin Singha, Narein Rao, Suvendu Samanta, Laxmidhar, Behera, Pawan Goyal

TL;DR
TransLIST is a novel transformer-based Sanskrit tokenizer that effectively handles sandhi phenomena and out-of-vocabulary tokens, outperforming existing methods by 7.2 points on benchmark datasets.
Contribution
It introduces a linguistically informed transformer model with soft-masked attention and path ranking for improved Sanskrit word segmentation.
Findings
Outperforms state-of-the-art by 7.2 points in perfect match metric
Handles out-of-vocabulary tokens effectively
Incorporates sandhi-specific encoding and novel attention mechanisms
Abstract
Sanskrit Word Segmentation (SWS) is essential in making digitized texts available and in deploying downstream tasks. It is, however, non-trivial because of the sandhi phenomenon that modifies the characters at the word boundaries, and needs special treatment. Existing lexicon driven approaches for SWS make use of Sanskrit Heritage Reader, a lexicon-driven shallow parser, to generate the complete candidate solution space, over which various methods are applied to produce the most valid solution. However, these approaches fail while encountering out-of-vocabulary tokens. On the other hand, purely engineering methods for SWS have made use of recent advances in deep learning, but cannot make use of the latent word information on availability. To mitigate the shortcomings of both families of approaches, we propose Transformer based Linguistically Informed Sanskrit Tokenizer (TransLIST)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · fail · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings · Layer Normalization · Byte Pair Encoding
