One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks
Sebastian Nehrdich, Oliver Hellwig, Kurt Keutzer

TL;DR
ByT5-Sanskrit is a versatile, byte-level pretrained language model that excels in Sanskrit NLP tasks, outperforming previous approaches and enabling robust, multitask applications for morphologically rich languages.
Contribution
The paper introduces ByT5-Sanskrit, a byte-level pretrained model that achieves state-of-the-art results in Sanskrit NLP tasks and demonstrates effectiveness across related languages.
Findings
Outperforms previous data-driven approaches in Sanskrit word segmentation
Achieves new state-of-the-art in Vedic Sanskrit dependency parsing and OCR post-correction
Yields improved scores for lemmatization and dependency parsing in other morphologically rich languages
Abstract
Morphologically rich languages are notoriously challenging to process for downstream NLP applications. This paper presents a new pretrained language model, ByT5-Sanskrit, designed for NLP applications involving the morphologically rich language Sanskrit. We evaluate ByT5-Sanskrit on established Sanskrit word segmentation tasks, where it outperforms previous data-driven approaches by a considerable margin and matches the performance of the current best lexicon-based model. It is easier to deploy and more robust to data not covered by external linguistic resources. It also achieves new state-of-the-art results in Vedic Sanskrit dependency parsing and OCR post-correction tasks. Additionally, based on the Digital Corpus of Sanskrit, we introduce a novel multitask dataset for the joint training of Sanskrit word segmentation, lemmatization, and morphosyntactic tagging tasks. We fine-tune…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Text Readability and Simplification
