TransLIST: A Transformer-Based Linguistically Informed Sanskrit   Tokenizer

Jivnesh Sandhan; Rathin Singha; Narein Rao; Suvendu Samanta; Laxmidhar; Behera; Pawan Goyal

arXiv:2210.11753·cs.CL·October 24, 2022·1 cites

TransLIST: A Transformer-Based Linguistically Informed Sanskrit Tokenizer

Jivnesh Sandhan, Rathin Singha, Narein Rao, Suvendu Samanta, Laxmidhar, Behera, Pawan Goyal

PDF

Open Access 1 Repo

TL;DR

TransLIST is a novel transformer-based Sanskrit tokenizer that effectively handles sandhi phenomena and out-of-vocabulary tokens, outperforming existing methods by 7.2 points on benchmark datasets.

Contribution

It introduces a linguistically informed transformer model with soft-masked attention and path ranking for improved Sanskrit word segmentation.

Findings

01

Outperforms state-of-the-art by 7.2 points in perfect match metric

02

Handles out-of-vocabulary tokens effectively

03

Incorporates sandhi-specific encoding and novel attention mechanisms

Abstract

Sanskrit Word Segmentation (SWS) is essential in making digitized texts available and in deploying downstream tasks. It is, however, non-trivial because of the sandhi phenomenon that modifies the characters at the word boundaries, and needs special treatment. Existing lexicon driven approaches for SWS make use of Sanskrit Heritage Reader, a lexicon-driven shallow parser, to generate the complete candidate solution space, over which various methods are applied to produce the most valid solution. However, these approaches fail while encountering out-of-vocabulary tokens. On the other hand, purely engineering methods for SWS have made use of recent advances in deep learning, but cannot make use of the latent word information on availability. To mitigate the shortcomings of both families of approaches, we propose Transformer based Linguistically Informed Sanskrit Tokenizer (TransLIST)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rsingha108/translist
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · fail · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing · Absolute Position Encodings · Layer Normalization · Byte Pair Encoding